百度智能云

All Product Document

          Elasticsearch

          Configure the -IK-Analyzer-Thesaurus

          Configure the thesaurus

          The user can define the thesaurus when the thesaurus of IK does not meet the requirements. The configuration steps are as follows:

          • The user puts the edited thesaurus on an http server.
          • The user configures the http address of the "ik" thesaurus in "es". For example, Baidu AI Cloud's word configuration file is "baidu.dict". When the configuration file of the stop word is "baidu_stop.dic", send a command to "es" as follows:

              PUT /_cluster/settings
              {
                  "persistent": {
                      "bpack.ik_analyzer.remote_ext_dict":"http://ip:port/baidu.dic",
                      "bpack.ik_analyzer.remote_ext_stopwords":"http://ip:port/baidu_stop.dic"
                  }
              }
          • "Es" checks whether the thesaurus file directed by the "http url" in the setting changes every 60s. If so, "es" automatically downloads the file and load it into the "ik".

          Verify whether the thesaurus is valid

          After configuration, the user can check through "API POST /_analyze" whether the thesaurus is valid. For example:

          • Before configuring the thesaurus, send the command
              POST /_analyze
              {
              	 "analyzer" : "ik_smart",
              	 "text" : ["Zhao Xiaomingming is so handsome"]
              }

          The returned result of the es is as follows:

             {
                "tokens": [
                   {
                      "token": "Zhao",
                      "start_offset": 0,
                      "end_offset": 1,
                      "type": "CN_WORD",
                      "position": 0
                   },
                   {
                      "token": "Xiaoming",
                      "start_offset": 1,
                      "end_offset": 3,
                      "type": "CN_WORD",
                      "position": 1
                   },
                   {
                      "token": "Ming",
                      "start_offset": 3,
                      "end_offset": 4,
                      "type": "CN_WORD",
                      "position": 2
                   },
                   {
                      "token": "So handsome"",
                      "start_offset": 4,
                      "end_offset": 6,
                      "type": "CN_WORD",
                      "position": 3
                   }
                ]
             }
          • Then, configure the thesaurus. The normal thesaurus only contains "Zhao Xiaomingming" and the stop word thesaurus contains "So handsome". After configuration, call "/_analyze api" again. The results are as follows:
           	{
           	   "tokens": [
           	      {
           	         "token": "Zhao Xiaomingming",
           	         "start_offset": 0,
           	         "end_offset": 4,
           	         "type": "CN_WORD",
           	         "position": 0
           	      }
           	   ]
           	}

          The result shows that "Zhao Xiaomingming" is a separate word, while "so handsome" is removed as stop words.

          Previous
          Backup Recovery
          Next
          Upgrade Between Individual Version of the Es