NLP Chinese Word Segmentation Plugin
The analysis-baidu-nlp
is the Chinese word segmentation plug-in developed independently by the Baidu AI Cloud Elasticsearch (ES) team. The performance and accuracy rate of the plug-in in the Chinese word segmentation are at the advanced level in the industry.
Background
The analysis-baidu-nlp
is based on the DeepCRF model developed independently by Baidu NLP. The model condenses Baidu's technology accumulation of over ten years in the Chinese search field. The performance and accuracy rate of the model are at the leading position in the industry.
Provide basic granularity and phrase granularity word segmentation results for different application requirements. The phrase granularity is the result of intelligent combination of basic granularity segmentations.
Note: The dictionary model is loaded to the out-of-core memory of JVM when it is used at the first time. The package node memory that we recommend to use is 8G above.
Word Segmentation Granularity
analysis-baidu-nlp Mainly provide Analyzer
of two granularities:
- Basic granularity model (bd-nlp-basic)
- Phrase granularity model (bd-nlp-phrase)
Two Analyzer
internally integrated case sensitivity filter, stopwords filter, out-of-the-box
Provide two kinds of Tokenizers
with the same name:
- Basic model granularity (bd-nlp-basic)
- Phrase large granularity model (bd-nlp-phrase)
The two granularities Tokenizers
only provide the most original word segmentation results. The user can add the custom stopwords filter and some complex filters according to self-own application requirements.
Comparison with ik in the Basic Granularity and Phrase Granularity Word Segmentation
Comparison in basic granularity
Conduct comparison in the basic maximum granularity word segmentation effect for "Maintenance Fund"
- bd-nlp-basic Word segmentation
POST /_analyze
{
"text": " Maintenance fund ",
"analyzer": "bd-nlp-basic"
}
Word segmentation result:
{
"tokens": [
{
"token": " Maintenance ",
"start_offset": 0,
"end_offset": 2,
"type": "WORD",
"position": 0
},
{
"token": " Fund ",
"start_offset": 2,
"end_offset": 4,
"type": "WORD",
"position": 1
}
]
}
- ik_max_word Word segmentation
POST _analyze
{
"analyzer": "ik_max_word",
"text": " Maintenance fund "
}
Word segmentation result:
{
"tokens": [
{
"token": " Maintenance fund ",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 0
},
{
"token": " Maintenance ",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 1
},
{
"token": " Maintenance ",
"start_offset": 0,
"end_offset": 1,
"type": "CN_WORD",
"position": 2
},
{
"token": " Maintenance ",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 3
},
{
"token": " Fund ",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 4
},
{
"token": " Fund ",
"start_offset": 2,
"end_offset": 3,
"type": "CN_WORD",
"position": 5
},
{
"token": " Fund ",
"start_offset": 3,
"end_offset": 4,
"type": "CN_CHAR",
"position": 6
}
]
}
Conduct phrase word segmentation effect comparison for "Qingming Festival, also known as Outing Festival, Xingqing Festival, March Festival, Ancestor Worship Festival, etc.".
- bd-nlp-basic Phrase word segmentation
POST /_analyze
{
"text": " Qingming Festival is also known as Outing Festival, Xingqing Festival, March Festival, Ancestor Worship Festival, etc. ",
"analyzer": "bd-nlp-phrase"
}
Phrase segmentation result:
{
"tokens": [
{
"token": " Qingming Festival ",
"start_offset": 0,
"end_offset": 3,
"type": "WORD",
"position": 0
},
{
"token": " Also known as ",
"start_offset": 4,
"end_offset": 6,
"type": "WORD",
"position": 2
},
{
"token": " Outing Festival ",
"start_offset": 6,
"end_offset": 9,
"type": "WORD",
"position": 3
},
{
"token": " Xingqing Festival ",
"start_offset": 10,
"end_offset": 13,
"type": "WORD",
"position": 5
},
{
"token": " March Festival ",
"start_offset": 14,
"end_offset": 17,
"type": "WORD",
"position": 7
},
{
"token": " Ancestor Worship ",
"start_offset": 18,
"end_offset": 20,
"type": "WORD",
"position": 9
},
{
"token": " Byte ",
"start_offset": 20,
"end_offset": 21,
"type": "WORD",
"position": 10
}
]
}
- ik_smart Intelligent word segmentation
POST _analyze
{
"analyzer": "ik_smart",
"text": " Qingming Festival is also known as Outing Festival, Xingqing Festival, March Festival, Ancestor Worship Festival, etc. "
}
Word segmentation result:
{
"tokens": [
{
"token": " Qingming Festival ",
"start_offset": 0,
"end_offset": 3,
"type": "CN_WORD",
"position": 0
},
{
"token": " Also known as ",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 1
},
{
"token": " Outing ",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 2
},
{
"token": " Byte ",
"start_offset": 8,
"end_offset": 9,
"type": "CN_WORD",
"position": 3
},
{
"token": " Row ",
"start_offset": 10,
"end_offset": 11,
"type": "CN_WORD",
"position": 4
},
{
"token": " Qing ",
"start_offset": 11,
"end_offset": 12,
"type": "CN_CHAR",
"position": 5
},
{
"token": " Byte ",
"start_offset": 12,
"end_offset": 13,
"type": "CN_WORD",
"position": 6
},
{
"token": " March ",
"start_offset": 14,
"end_offset": 16,
"type": "CN_WORD",
"position": 7
},
{
"token": " Byte ",
"start_offset": 16,
"end_offset": 17,
"type": "COUNT",
"position": 8
},
{
"token": " Ancestor Worship ",
"start_offset": 18,
"end_offset": 20,
"type": "CN_WORD",
"position": 9
},
{
"token": " Byte ",
"start_offset": 20,
"end_offset": 21,
"type": "CN_WORD",
"position": 10
}
]
}
Analyze API Use
Basic model granularity word segmentation
POST /_analyze
{
"analyzer": "bd-nlp-basic",
"text": " Last year, we had a Fireside Competition with them. We won in the first round but was defeated in the second round and the third round. "
}
Word segmentation result:
{
"tokens": [
{
"token": " Last year ",
"start_offset": 0,
"end_offset": 2,
"type": "WORD",
"position": 0
},
{
"token": " We ",
"start_offset": 2,
"end_offset": 4,
"type": "WORD",
"position": 1
},
{
"token": " and ",
"start_offset": 4,
"end_offset": 5,
"type": "WORD",
"position": 2
},
{
"token": " Them ",
"start_offset": 5,
"end_offset": 7,
"type": "WORD",
"position": 3
},
{
"token": " Had ",
"start_offset": 7,
"end_offset": 9,
"type": "WORD",
"position": 4
},
{
"token": " Fireside ",
"start_offset": 10,
"end_offset": 12,
"type": "WORD",
"position": 6
},
{
"token": " Competition ",
"start_offset": 12,
"end_offset": 14,
"type": "WORD",
"position": 7
},
{
"token": " The first ",
"start_offset": 15,
"end_offset": 17,
"type": "WORD",
"position": 9
},
{
"token": " Round ",
"start_offset": 17,
"end_offset": 19,
"type": "WORD",
"position": 10
},
{
"token": " Won ",
"start_offset": 19,
"end_offset": 20,
"type": "WORD",
"position": 11
},
{
"token": " The second ",
"start_offset": 22,
"end_offset": 24,
"type": "WORD",
"position": 14
},
{
"token": " Round ",
"start_offset": 24,
"end_offset": 26,
"type": "WORD",
"position": 15
},
{
"token": " and ",
"start_offset": 26,
"end_offset": 27,
"type": "WORD",
"position": 16
},
{
"token": " The third ",
"start_offset": 27,
"end_offset": 29,
"type": "WORD",
"position": 17
},
{
"token": " Round ",
"start_offset": 29,
"end_offset": 31,
"type": "WORD",
"position": 18
},
{
"token": " Failure ",
"start_offset": 32,
"end_offset": 33,
"type": "WORD",
"position": 20
},
{
"token": " Failure ",
"start_offset": 33,
"end_offset": 34,
"type": "WORD",
"position": 21
},
{
"token": " Failure ",
"start_offset": 34,
"end_offset": 35,
"type": "WORD",
"position": 22
},
{
"token": " Failure",
"start_offset": 35,
"end_offset": 36,
"type": "WORD",
"position": 23
}
]
}
Phrase model large granularity word segmentation
POST /_analyze
{
"analyzer": "bd-nlp-phrase",
"text": " Last year, we had a Fireside Competition with them. We won in the first round but was defeated in the second round and the third round. 。"
}
Word segmentation result:
{
"tokens": [
{
"token": " Last year ",
"start_offset": 0,
"end_offset": 2,
"type": "WORD",
"position": 0
},
{
"token": " We ",
"start_offset": 2,
"end_offset": 4,
"type": "WORD",
"position": 1
},
{
"token": " and ",
"start_offset": 4,
"end_offset": 5,
"type": "WORD",
"position": 2
},
{
"token": " Them ",
"start_offset": 5,
"end_offset": 7,
"type": "WORD",
"position": 3
},
{
"token": " Had ",
"start_offset": 7,
"end_offset": 9,
"type": "WORD",
"position": 4
},
{
"token": " Fireside Competition ",
"start_offset": 10,
"end_offset": 14,
"type": "WORD",
"position": 6
},
{
"token": " The first round ",
"start_offset": 15,
"end_offset": 19,
"type": "WORD",
"position": 8
},
{
"token": " Won ",
"start_offset": 19,
"end_offset": 20,
"type": "WORD",
"position": 9
},
{
"token": " The second round ",
"start_offset": 22,
"end_offset": 26,
"type": "WORD",
"position": 12
},
{
"token": " and ",
"start_offset": 26,
"end_offset": 27,
"type": "WORD",
"position": 13
},
{
"token": " The third ",
"start_offset": 27,
"end_offset": 29,
"type": "WORD",
"position": 14
},
{
"token": " Round ",
"start_offset": 29,
"end_offset": 31,
"type": "WORD",
"position": 15
},
{
"token": " Failure ",
"start_offset": 32,
"end_offset": 33,
"type": "WORD",
"position": 17
},
{
"token": " Failure ",
"start_offset": 33,
"end_offset": 34,
"type": "WORD",
"position": 18
},
{
"token": " Failure ",
"start_offset": 34,
"end_offset": 35,
"type": "WORD",
"position": 19
},
{
"token": " Failure ",
"start_offset": 35,
"end_offset": 36,
"type": "WORD",
"position": 20
}
]
}
Index Assignment Analyzer
PUT test
{
"mappings": {
"doc": {
"properties": {
"k1": {
"type": "text",
"analyzer": "bd-nlp-basic" // Use the basic granularity model
},
"k2": {
"type": "text",
"analyzer": "bd-nlp-phrase" // Use the phrase granularity model
}
}
}
},
"settings": {
"index": {
"number_of_shards": "1",
"number_of_replicas": "0"
}
}
}
Index Assignment Tokenizer
PUT /test
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"tokenizer":"bd-nlp-basic", // Customize an analyzer
"filter":[
"lowercase" // Add filters required by the application
]
}
}
}
},
"mappings":{
"properties":{
"k2":{
"type":"text",
"analyzer":"my_analyzer" // Apply the custom analyzer to the corresponding field
}
}
}
} }
}
Accuracy Rate and Recall Rate
Big data set test results in Baidu:
Model | Accuracy rate | Recall rate | F value |
---|---|---|---|
analysis-baidu-nlp | 98.8% | 98.9% | 98.8% |