百度智能云

All Product Document

          Elasticsearch

          NLP Chinese Word Segmentation Plugin

          The analysis-baidu-nlp is the Chinese word segmentation plug-in developed independently by the Baidu AI Cloud Elasticsearch (ES) team. The performance and accuracy rate of the plug-in in the Chinese word segmentation are at the advanced level in the industry.

          Background

          The analysis-baidu-nlp is based on the DeepCRF model developed independently by Baidu NLP. The model condenses Baidu's technology accumulation of over ten years in the Chinese search field. The performance and accuracy rate of the model are at the leading position in the industry.

          Provide basic granularity and phrase granularity word segmentation results for different application requirements. The phrase granularity is the result of intelligent combination of basic granularity segmentations.

          Note: The dictionary model is loaded to the out-of-core memory of JVM when it is used at the first time. The package node memory that we recommend to use is 8G above.

          Word Segmentation Granularity

          analysis-baidu-nlp Mainly provide Analyzer of two granularities:

          1. Basic granularity model (bd-nlp-basic)
          2. Phrase granularity model (bd-nlp-phrase)

          Two Analyzer internally integrated case sensitivity filter, stopwords filter, out-of-the-box

          Provide two kinds of Tokenizers with the same name:

          1. Basic model granularity (bd-nlp-basic)
          2. Phrase large granularity model (bd-nlp-phrase)

          The two granularities Tokenizers only provide the most original word segmentation results. The user can add the custom stopwords filter and some complex filters according to self-own application requirements.

          Comparison with ik in the Basic Granularity and Phrase Granularity Word Segmentation

          Comparison in basic granularity

          Conduct comparison in the basic maximum granularity word segmentation effect for "Maintenance Fund"

          • bd-nlp-basic Word segmentation
          POST /_analyze 
          { 
              "text": " Maintenance fund ", 
              "analyzer": "bd-nlp-basic" 
          } 

          Word segmentation result:

          { 
             "tokens": [ 
                { 
                   "token": " Maintenance ", 
                   "start_offset": 0,
                   "end_offset": 2,
                   "type": "WORD", 
                   "position": 0 
                }, 
                { 
                   "token": " Fund ", 
                   "start_offset": 2,
                   "end_offset": 4,
                   "type": "WORD", 
                   "position": 1 
                } 
             ] 
          } 
          • ik_max_word Word segmentation
          POST _analyze 
          { 
              "analyzer": "ik_max_word", 
              "text": " Maintenance fund " 
          } 

          Word segmentation result:

          { 
             "tokens": [ 
                { 
                   "token": " Maintenance fund ", 
                   "start_offset": 0,
                   "end_offset": 4,
                   "type": "CN_WORD", 
                   "position": 0 
                }, 
                { 
                   "token": " Maintenance ", 
                   "start_offset": 0,
                   "end_offset": 2,
                   "type": "CN_WORD", 
                   "position": 1 
                }, 
                { 
                   "token": " Maintenance ", 
                   "start_offset": 0,
                   "end_offset": 1,
                   "type": "CN_WORD", 
                   "position": 2 
                }, 
                { 
                   "token": " Maintenance ", 
                   "start_offset": 1,
                   "end_offset": 2,
                   "type": "CN_CHAR", 
                   "position": 3 
                }, 
                { 
                   "token": " Fund ", 
                   "start_offset": 2,
                   "end_offset": 4,
                   "type": "CN_WORD", 
                   "position": 4 
                }, 
                { 
                   "token": " Fund ", 
                   "start_offset": 2,
                   "end_offset": 3,
                   "type": "CN_WORD", 
                   "position": 5 
                }, 
                { 
                   "token": " Fund ", 
                   "start_offset": 3,
                   "end_offset": 4,
                   "type": "CN_CHAR", 
                   "position": 6 
                } 
             ] 
          } 

          Conduct phrase word segmentation effect comparison for "Qingming Festival, also known as Outing Festival, Xingqing Festival, March Festival, Ancestor Worship Festival, etc.".

          • bd-nlp-basic Phrase word segmentation
          POST /_analyze 
          { 
              "text": " Qingming Festival is also known as Outing Festival, Xingqing Festival, March Festival, Ancestor Worship Festival, etc. ", 
              "analyzer": "bd-nlp-phrase" 
          } 

          Phrase segmentation result:

          { 
             "tokens": [ 
                { 
                   "token": " Qingming Festival ", 
                   "start_offset": 0,
                   "end_offset": 3,
                   "type": "WORD", 
                   "position": 0 
                }, 
                { 
                   "token": " Also known as ", 
                   "start_offset": 4,
                   "end_offset": 6,
                   "type": "WORD", 
                   "position": 2 
                }, 
                { 
                   "token": " Outing Festival ", 
                   "start_offset": 6,
                   "end_offset": 9,
                   "type": "WORD", 
                   "position": 3 
                }, 
                { 
                   "token": " Xingqing Festival ", 
                   "start_offset": 10,
                   "end_offset": 13,
                   "type": "WORD", 
                   "position": 5 
                }, 
                { 
                   "token": " March Festival ", 
                   "start_offset": 14,
                   "end_offset": 17,
                   "type": "WORD", 
                   "position": 7 
                }, 
                { 
                   "token": " Ancestor Worship ", 
                   "start_offset": 18,
                   "end_offset": 20,
                   "type": "WORD", 
                   "position": 9 
                }, 
                { 
                   "token": " Byte ", 
                   "start_offset": 20,
                   "end_offset": 21,
                   "type": "WORD", 
                   "position": 10 
                } 
             ] 
          } 
          • ik_smart Intelligent word segmentation
          POST _analyze 
          { 
              "analyzer": "ik_smart", 
              "text": " Qingming Festival is also known as Outing Festival, Xingqing Festival, March Festival, Ancestor Worship Festival, etc. " 
          } 

          Word segmentation result:

          { 
             "tokens": [ 
                { 
                   "token": " Qingming Festival ", 
                   "start_offset": 0,
                   "end_offset": 3,
                   "type": "CN_WORD", 
                   "position": 0 
                }, 
                { 
                   "token": " Also known as ", 
                   "start_offset": 4,
                   "end_offset": 6,
                   "type": "CN_WORD", 
                   "position": 1 
                }, 
                { 
                   "token": " Outing ", 
                   "start_offset": 6,
                   "end_offset": 8,
                   "type": "CN_WORD", 
                   "position": 2 
                }, 
                { 
                   "token": " Byte ", 
                   "start_offset": 8,
                   "end_offset": 9,
                   "type": "CN_WORD", 
                   "position": 3 
                }, 
                { 
                   "token": " Row ", 
                   "start_offset": 10,
                   "end_offset": 11,
                   "type": "CN_WORD", 
                   "position": 4 
                }, 
                { 
                   "token": " Qing ", 
                   "start_offset": 11,
                   "end_offset": 12,
                   "type": "CN_CHAR", 
                   "position": 5 
                }, 
                { 
                   "token": " Byte ", 
                   "start_offset": 12,
                   "end_offset": 13,
                   "type": "CN_WORD", 
                   "position": 6 
                }, 
                { 
                   "token": " March ", 
                   "start_offset": 14,
                   "end_offset": 16,
                   "type": "CN_WORD", 
                   "position": 7 
                }, 
                { 
                   "token": " Byte ", 
                   "start_offset": 16,
                   "end_offset": 17,
                   "type": "COUNT", 
                   "position": 8 
                }, 
                { 
                   "token": " Ancestor Worship ", 
                   "start_offset": 18,
                   "end_offset": 20,
                   "type": "CN_WORD", 
                   "position": 9 
                }, 
                { 
                   "token": " Byte ", 
                   "start_offset": 20,
                   "end_offset": 21,
                   "type": "CN_WORD", 
                   "position": 10 
                } 
             ] 
          } 

          Analyze API Use

          Basic model granularity word segmentation

          POST /_analyze 
          { 
             "analyzer": "bd-nlp-basic", 
             "text": " Last year, we had a Fireside Competition with them. We won in the first round but was defeated in the second round and the third round. " 
          } 

          Word segmentation result:

          { 
            "tokens": [ 
               { 
                  "token": " Last year ", 
                  "start_offset": 0,
                  "end_offset": 2,
                  "type": "WORD", 
                  "position": 0 
               }, 
               { 
                  "token": " We ", 
                  "start_offset": 2,
                  "end_offset": 4,
                  "type": "WORD", 
                  "position": 1 
               }, 
               { 
                  "token": " and ", 
                  "start_offset": 4,
                  "end_offset": 5,
                  "type": "WORD", 
                  "position": 2 
               }, 
               { 
                  "token": " Them ", 
                  "start_offset": 5,
                  "end_offset": 7,
                  "type": "WORD", 
                  "position": 3 
               }, 
               { 
                  "token": " Had ", 
                  "start_offset": 7,
                  "end_offset": 9,
                  "type": "WORD", 
                  "position": 4 
               }, 
               { 
                  "token": " Fireside ", 
                  "start_offset": 10,
                  "end_offset": 12,
                  "type": "WORD", 
                  "position": 6 
               }, 
               { 
                  "token": " Competition ", 
                  "start_offset": 12,
                  "end_offset": 14,
                  "type": "WORD", 
                  "position": 7 
               }, 
               { 
                  "token": " The first ", 
                  "start_offset": 15,
                  "end_offset": 17,
                  "type": "WORD", 
                  "position": 9 
               }, 
               { 
                  "token": " Round ", 
                  "start_offset": 17,
                  "end_offset": 19,
                  "type": "WORD", 
                  "position": 10 
               }, 
               { 
                  "token": " Won ", 
                  "start_offset": 19,
                  "end_offset": 20,
                  "type": "WORD", 
                  "position": 11 
               }, 
               { 
                  "token": " The second ", 
                  "start_offset": 22,
                  "end_offset": 24,
                  "type": "WORD", 
                  "position": 14 
               }, 
               { 
                  "token": " Round ", 
                  "start_offset": 24,
                  "end_offset": 26,
                  "type": "WORD", 
                  "position": 15 
               }, 
               { 
                  "token": " and ", 
                  "start_offset": 26,
                  "end_offset": 27,
                  "type": "WORD", 
                  "position": 16 
               }, 
               { 
                  "token": " The third ", 
                  "start_offset": 27,
                  "end_offset": 29,
                  "type": "WORD", 
                  "position": 17 
               }, 
               { 
                  "token": " Round ", 
                  "start_offset": 29,
                  "end_offset": 31,
                  "type": "WORD", 
                  "position": 18 
               }, 
               { 
                  "token": " Failure ", 
                  "start_offset": 32,
                  "end_offset": 33,
                  "type": "WORD", 
                  "position": 20 
               }, 
               { 
                  "token": " Failure ", 
                  "start_offset": 33,
                  "end_offset": 34,
                  "type": "WORD", 
                  "position": 21 
               }, 
               { 
                  "token": " Failure ", 
                  "start_offset": 34,
                  "end_offset": 35,
                  "type": "WORD", 
                  "position": 22 
               }, 
               { 
                  "token": " Failure", 
                  "start_offset": 35,
                  "end_offset": 36,
                  "type": "WORD", 
                  "position": 23 
               } 
            ] 
          } 

          Phrase model large granularity word segmentation

          POST /_analyze 
          { 
             "analyzer": "bd-nlp-phrase", 
             "text": " Last year, we had a Fireside Competition with them. We won in the first round but was defeated in the second round and the third round. 。" 
          } 

          Word segmentation result:

          { 
            "tokens": [ 
               { 
                  "token": " Last year ", 
                  "start_offset": 0,
                  "end_offset": 2,
                  "type": "WORD", 
                  "position": 0 
               }, 
               { 
                  "token": " We ", 
                  "start_offset": 2,
                  "end_offset": 4,
                  "type": "WORD", 
                  "position": 1 
               }, 
               { 
                  "token": " and ", 
                  "start_offset": 4,
                  "end_offset": 5,
                  "type": "WORD", 
                  "position": 2 
               }, 
               { 
                  "token": " Them ", 
                  "start_offset": 5,
                  "end_offset": 7,
                  "type": "WORD", 
                  "position": 3 
               }, 
               { 
                  "token": " Had ", 
                  "start_offset": 7,
                  "end_offset": 9,
                  "type": "WORD", 
                  "position": 4 
               }, 
               { 
                  "token": " Fireside Competition ", 
                  "start_offset": 10,
                  "end_offset": 14,
                  "type": "WORD", 
                  "position": 6 
               }, 
               { 
                  "token": " The first round ", 
                  "start_offset": 15,
                  "end_offset": 19,
                  "type": "WORD", 
                  "position": 8 
               }, 
               { 
                  "token": " Won ", 
                  "start_offset": 19,
                  "end_offset": 20,
                  "type": "WORD", 
                  "position": 9 
               }, 
               { 
                  "token": " The second round ", 
                  "start_offset": 22,
                  "end_offset": 26,
                  "type": "WORD", 
                  "position": 12 
               }, 
               { 
                  "token": " and ", 
                  "start_offset": 26,
                  "end_offset": 27,
                  "type": "WORD", 
                  "position": 13 
               }, 
               { 
                  "token": " The third ", 
                  "start_offset": 27,
                  "end_offset": 29,
                  "type": "WORD", 
                  "position": 14 
               }, 
               { 
                  "token": " Round ", 
                  "start_offset": 29,
                  "end_offset": 31,
                  "type": "WORD", 
                  "position": 15 
               }, 
               { 
                  "token": " Failure ", 
                  "start_offset": 32,
                  "end_offset": 33,
                  "type": "WORD", 
                  "position": 17 
               }, 
               { 
                  "token": " Failure ", 
                  "start_offset": 33,
                  "end_offset": 34,
                  "type": "WORD", 
                  "position": 18 
               }, 
               { 
                  "token": " Failure ", 
                  "start_offset": 34,
                  "end_offset": 35,
                  "type": "WORD", 
                  "position": 19 
               }, 
               { 
                  "token": " Failure ", 
                  "start_offset": 35,
                  "end_offset": 36,
                  "type": "WORD", 
                  "position": 20 
               } 
            ] 
          } 

          Index Assignment Analyzer

          PUT test 
          { 
             "mappings": { 
                "doc": { 
                   "properties": { 
                      "k1": { 
                         "type": "text", 
                         "analyzer": "bd-nlp-basic" // Use the basic granularity model 
                      }, 
                      "k2": { 
                         "type": "text", 
                         "analyzer": "bd-nlp-phrase" // Use the phrase granularity model 
                      } 
                   } 
                } 
             }, 
             "settings": { 
                "index": { 
                   "number_of_shards": "1",
                   "number_of_replicas": "0"
                } 
             } 
          } 

          Index Assignment Tokenizer

          PUT /test 
          { 
              "settings":{ 
                  "analysis":{ 
                      "analyzer":{ 
                          "my_analyzer":{ 
                              "tokenizer":"bd-nlp-basic",   // Customize an analyzer 
                              "filter":[ 
                                  "lowercase"               // Add filters required by the application 
                              ] 
                          } 
                      } 
                  } 
              }, 
              "mappings":{ 
                  "properties":{ 
                      "k2":{ 
                          "type":"text", 
                          "analyzer":"my_analyzer"         // Apply the custom analyzer to the corresponding field 
                      } 
                  } 
              } 
          } } 
          } 

          Accuracy Rate and Recall Rate

          Big data set test results in Baidu:

          Model Accuracy rate Recall rate F value
          analysis-baidu-nlp 98.8% 98.9% 98.8%
          Previous
          Upgrade Between Individual Version of the Es
          Next
          Vector Search Plug-in User Guide