NLP Chinese Word Segmentation Plugin

Last Updated：2020-09-21

The analysis-baidu-nlp is the Chinese word segmentation plug-in developed independently by the Baidu AI Cloud Elasticsearch (ES) team. The performance and accuracy rate of the plug-in in the Chinese word segmentation are at the advanced level in the industry.

Background

The analysis-baidu-nlp is based on the DeepCRF model developed independently by Baidu NLP. The model condenses Baidu's technology accumulation of over ten years in the Chinese search field. The performance and accuracy rate of the model are at the leading position in the industry.

Provide basic granularity and phrase granularity word segmentation results for different application requirements. The phrase granularity is the result of intelligent combination of basic granularity segmentations.

Note: The dictionary model is loaded to the out-of-core memory of JVM when it is used at the first time. The package node memory that we recommend to use is 8G above.

Word Segmentation Granularity

analysis-baidu-nlp Mainly provide Analyzer of two granularities:

Basic granularity model (bd-nlp-basic)
Phrase granularity model (bd-nlp-phrase)

Two Analyzer internally integrated case sensitivity filter, stopwords filter, out-of-the-box

Provide two kinds of Tokenizers with the same name:

Basic model granularity (bd-nlp-basic)
Phrase large granularity model (bd-nlp-phrase)

The two granularities Tokenizers only provide the most original word segmentation results. The user can add the custom stopwords filter and some complex filters according to self-own application requirements.

Comparison with ik in the Basic Granularity and Phrase Granularity Word Segmentation

Comparison in basic granularity

Conduct comparison in the basic maximum granularity word segmentation effect for "Maintenance Fund"

bd-nlp-basic Word segmentation

POST /_analyze 
{ 
    "text": " Maintenance fund ", 
    "analyzer": "bd-nlp-basic" 
}

Word segmentation result:

{ 
   "tokens": [ 
      { 
         "token": " Maintenance ", 
         "start_offset": 0,
         "end_offset": 2,
         "type": "WORD", 
         "position": 0 
      }, 
      { 
         "token": " Fund ", 
         "start_offset": 2,
         "end_offset": 4,
         "type": "WORD", 
         "position": 1 
      } 
   ] 
}

ik_max_word Word segmentation

POST _analyze 
{ 
    "analyzer": "ik_max_word", 
    "text": " Maintenance fund " 
}

Word segmentation result:

{ 
   "tokens": [ 
      { 
         "token": " Maintenance fund ", 
         "start_offset": 0,
         "end_offset": 4,
         "type": "CN_WORD", 
         "position": 0 
      }, 
      { 
         "token": " Maintenance ", 
         "start_offset": 0,
         "end_offset": 2,
         "type": "CN_WORD", 
         "position": 1 
      }, 
      { 
         "token": " Maintenance ", 
         "start_offset": 0,
         "end_offset": 1,
         "type": "CN_WORD", 
         "position": 2 
      }, 
      { 
         "token": " Maintenance ", 
         "start_offset": 1,
         "end_offset": 2,
         "type": "CN_CHAR", 
         "position": 3 
      }, 
      { 
         "token": " Fund ", 
         "start_offset": 2,
         "end_offset": 4,
         "type": "CN_WORD", 
         "position": 4 
      }, 
      { 
         "token": " Fund ", 
         "start_offset": 2,
         "end_offset": 3,
         "type": "CN_WORD", 
         "position": 5 
      }, 
      { 
         "token": " Fund ", 
         "start_offset": 3,
         "end_offset": 4,
         "type": "CN_CHAR", 
         "position": 6 
      } 
   ] 
}

Conduct phrase word segmentation effect comparison for "Qingming Festival, also known as Outing Festival, Xingqing Festival, March Festival, Ancestor Worship Festival, etc.".

bd-nlp-basic Phrase word segmentation

POST /_analyze 
{ 
    "text": " Qingming Festival is also known as Outing Festival, Xingqing Festival, March Festival, Ancestor Worship Festival, etc. ", 
    "analyzer": "bd-nlp-phrase" 
}

Phrase segmentation result:

{ 
   "tokens": [ 
      { 
         "token": " Qingming Festival ", 
         "start_offset": 0,
         "end_offset": 3,
         "type": "WORD", 
         "position": 0 
      }, 
      { 
         "token": " Also known as ", 
         "start_offset": 4,
         "end_offset": 6,
         "type": "WORD", 
         "position": 2 
      }, 
      { 
         "token": " Outing Festival ", 
         "start_offset": 6,
         "end_offset": 9,
         "type": "WORD", 
         "position": 3 
      }, 
      { 
         "token": " Xingqing Festival ", 
         "start_offset": 10,
         "end_offset": 13,
         "type": "WORD", 
         "position": 5 
      }, 
      { 
         "token": " March Festival ", 
         "start_offset": 14,
         "end_offset": 17,
         "type": "WORD", 
         "position": 7 
      }, 
      { 
         "token": " Ancestor Worship ", 
         "start_offset": 18,
         "end_offset": 20,
         "type": "WORD", 
         "position": 9 
      }, 
      { 
         "token": " Byte ", 
         "start_offset": 20,
         "end_offset": 21,
         "type": "WORD", 
         "position": 10 
      } 
   ] 
}

ik_smart Intelligent word segmentation

POST _analyze 
{ 
    "analyzer": "ik_smart", 
    "text": " Qingming Festival is also known as Outing Festival, Xingqing Festival, March Festival, Ancestor Worship Festival, etc. " 
}

Word segmentation result:

{ 
   "tokens": [ 
      { 
         "token": " Qingming Festival ", 
         "start_offset": 0,
         "end_offset": 3,
         "type": "CN_WORD", 
         "position": 0 
      }, 
      { 
         "token": " Also known as ", 
         "start_offset": 4,
         "end_offset": 6,
         "type": "CN_WORD", 
         "position": 1 
      }, 
      { 
         "token": " Outing ", 
         "start_offset": 6,
         "end_offset": 8,
         "type": "CN_WORD", 
         "position": 2 
      }, 
      { 
         "token": " Byte ", 
         "start_offset": 8,
         "end_offset": 9,
         "type": "CN_WORD", 
         "position": 3 
      }, 
      { 
         "token": " Row ", 
         "start_offset": 10,
         "end_offset": 11,
         "type": "CN_WORD", 
         "position": 4 
      }, 
      { 
         "token": " Qing ", 
         "start_offset": 11,
         "end_offset": 12,
         "type": "CN_CHAR", 
         "position": 5 
      }, 
      { 
         "token": " Byte ", 
         "start_offset": 12,
         "end_offset": 13,
         "type": "CN_WORD", 
         "position": 6 
      }, 
      { 
         "token": " March ", 
         "start_offset": 14,
         "end_offset": 16,
         "type": "CN_WORD", 
         "position": 7 
      }, 
      { 
         "token": " Byte ", 
         "start_offset": 16,
         "end_offset": 17,
         "type": "COUNT", 
         "position": 8 
      }, 
      { 
         "token": " Ancestor Worship ", 
         "start_offset": 18,
         "end_offset": 20,
         "type": "CN_WORD", 
         "position": 9 
      }, 
      { 
         "token": " Byte ", 
         "start_offset": 20,
         "end_offset": 21,
         "type": "CN_WORD", 
         "position": 10 
      } 
   ] 
}

Analyze API Use

Basic model granularity word segmentation

POST /_analyze 
{ 
   "analyzer": "bd-nlp-basic", 
   "text": " Last year, we had a Fireside Competition with them. We won in the first round but was defeated in the second round and the third round. " 
}

Word segmentation result:

{ 
  "tokens": [ 
     { 
        "token": " Last year ", 
        "start_offset": 0,
        "end_offset": 2,
        "type": "WORD", 
        "position": 0 
     }, 
     { 
        "token": " We ", 
        "start_offset": 2,
        "end_offset": 4,
        "type": "WORD", 
        "position": 1 
     }, 
     { 
        "token": " and ", 
        "start_offset": 4,
        "end_offset": 5,
        "type": "WORD", 
        "position": 2 
     }, 
     { 
        "token": " Them ", 
        "start_offset": 5,
        "end_offset": 7,
        "type": "WORD", 
        "position": 3 
     }, 
     { 
        "token": " Had ", 
        "start_offset": 7,
        "end_offset": 9,
        "type": "WORD", 
        "position": 4 
     }, 
     { 
        "token": " Fireside ", 
        "start_offset": 10,
        "end_offset": 12,
        "type": "WORD", 
        "position": 6 
     }, 
     { 
        "token": " Competition ", 
        "start_offset": 12,
        "end_offset": 14,
        "type": "WORD", 
        "position": 7 
     }, 
     { 
        "token": " The first ", 
        "start_offset": 15,
        "end_offset": 17,
        "type": "WORD", 
        "position": 9 
     }, 
     { 
        "token": " Round ", 
        "start_offset": 17,
        "end_offset": 19,
        "type": "WORD", 
        "position": 10 
     }, 
     { 
        "token": " Won ", 
        "start_offset": 19,
        "end_offset": 20,
        "type": "WORD", 
        "position": 11 
     }, 
     { 
        "token": " The second ", 
        "start_offset": 22,
        "end_offset": 24,
        "type": "WORD", 
        "position": 14 
     }, 
     { 
        "token": " Round ", 
        "start_offset": 24,
        "end_offset": 26,
        "type": "WORD", 
        "position": 15 
     }, 
     { 
        "token": " and ", 
        "start_offset": 26,
        "end_offset": 27,
        "type": "WORD", 
        "position": 16 
     }, 
     { 
        "token": " The third ", 
        "start_offset": 27,
        "end_offset": 29,
        "type": "WORD", 
        "position": 17 
     }, 
     { 
        "token": " Round ", 
        "start_offset": 29,
        "end_offset": 31,
        "type": "WORD", 
        "position": 18 
     }, 
     { 
        "token": " Failure ", 
        "start_offset": 32,
        "end_offset": 33,
        "type": "WORD", 
        "position": 20 
     }, 
     { 
        "token": " Failure ", 
        "start_offset": 33,
        "end_offset": 34,
        "type": "WORD", 
        "position": 21 
     }, 
     { 
        "token": " Failure ", 
        "start_offset": 34,
        "end_offset": 35,
        "type": "WORD", 
        "position": 22 
     }, 
     { 
        "token": " Failure", 
        "start_offset": 35,
        "end_offset": 36,
        "type": "WORD", 
        "position": 23 
     } 
  ] 
}

Phrase model large granularity word segmentation

POST /_analyze 
{ 
   "analyzer": "bd-nlp-phrase", 
   "text": " Last year, we had a Fireside Competition with them. We won in the first round but was defeated in the second round and the third round. 。" 
}

Word segmentation result:

{ 
  "tokens": [ 
     { 
        "token": " Last year ", 
        "start_offset": 0,
        "end_offset": 2,
        "type": "WORD", 
        "position": 0 
     }, 
     { 
        "token": " We ", 
        "start_offset": 2,
        "end_offset": 4,
        "type": "WORD", 
        "position": 1 
     }, 
     { 
        "token": " and ", 
        "start_offset": 4,
        "end_offset": 5,
        "type": "WORD", 
        "position": 2 
     }, 
     { 
        "token": " Them ", 
        "start_offset": 5,
        "end_offset": 7,
        "type": "WORD", 
        "position": 3 
     }, 
     { 
        "token": " Had ", 
        "start_offset": 7,
        "end_offset": 9,
        "type": "WORD", 
        "position": 4 
     }, 
     { 
        "token": " Fireside Competition ", 
        "start_offset": 10,
        "end_offset": 14,
        "type": "WORD", 
        "position": 6 
     }, 
     { 
        "token": " The first round ", 
        "start_offset": 15,
        "end_offset": 19,
        "type": "WORD", 
        "position": 8 
     }, 
     { 
        "token": " Won ", 
        "start_offset": 19,
        "end_offset": 20,
        "type": "WORD", 
        "position": 9 
     }, 
     { 
        "token": " The second round ", 
        "start_offset": 22,
        "end_offset": 26,
        "type": "WORD", 
        "position": 12 
     }, 
     { 
        "token": " and ", 
        "start_offset": 26,
        "end_offset": 27,
        "type": "WORD", 
        "position": 13 
     }, 
     { 
        "token": " The third ", 
        "start_offset": 27,
        "end_offset": 29,
        "type": "WORD", 
        "position": 14 
     }, 
     { 
        "token": " Round ", 
        "start_offset": 29,
        "end_offset": 31,
        "type": "WORD", 
        "position": 15 
     }, 
     { 
        "token": " Failure ", 
        "start_offset": 32,
        "end_offset": 33,
        "type": "WORD", 
        "position": 17 
     }, 
     { 
        "token": " Failure ", 
        "start_offset": 33,
        "end_offset": 34,
        "type": "WORD", 
        "position": 18 
     }, 
     { 
        "token": " Failure ", 
        "start_offset": 34,
        "end_offset": 35,
        "type": "WORD", 
        "position": 19 
     }, 
     { 
        "token": " Failure ", 
        "start_offset": 35,
        "end_offset": 36,
        "type": "WORD", 
        "position": 20 
     } 
  ] 
}

Index Assignment Analyzer

PUT test 
{ 
   "mappings": { 
      "doc": { 
         "properties": { 
            "k1": { 
               "type": "text", 
               "analyzer": "bd-nlp-basic" // Use the basic granularity model 
            }, 
            "k2": { 
               "type": "text", 
               "analyzer": "bd-nlp-phrase" // Use the phrase granularity model 
            } 
         } 
      } 
   }, 
   "settings": { 
      "index": { 
         "number_of_shards": "1",
         "number_of_replicas": "0"
      } 
   } 
}

Index Assignment Tokenizer

PUT /test 
{ 
    "settings":{ 
        "analysis":{ 
            "analyzer":{ 
                "my_analyzer":{ 
                    "tokenizer":"bd-nlp-basic",   // Customize an analyzer 
                    "filter":[ 
                        "lowercase"               // Add filters required by the application 
                    ] 
                } 
            } 
        } 
    }, 
    "mappings":{ 
        "properties":{ 
            "k2":{ 
                "type":"text", 
                "analyzer":"my_analyzer"         // Apply the custom analyzer to the corresponding field 
            } 
        } 
    } 
} } 
}

Accuracy Rate and Recall Rate

Big data set test results in Baidu:

Model	Accuracy rate	Recall rate	F value
analysis-baidu-nlp	98.8%	98.9%	98.8%

Upgrade Between Individual Version of the Es

Vector Search Plug-in User Guide