What are the best approach for Chinese/Japanese language indexing and searching?

shaunak · May 1, 2015, 12:04am

Chinese and Japanese are hard to get right, but below I've included what you need to get the basics working. You will need to install the kuromoji plugin, the smartcn plugin and the icu plugin.

For search, you should use the smartcn analyzer for chinese, and the kuromoji analyzer for japanese. For aggregations, if you want to use the terms aggregation, then you just need to set the field you want to aggregate on to be not_analyzed. That way, it'll use the whole value of that field as the term.

The typeahead search is where things get trickier. You should use the completion suggester for both of them, with preserve_separators set to false and without fuzziness. These suggesters will need a custom analyzer for each language.

For Chinese, you need this:

PUT /chinese
{
  "settings": {
    "analysis": {
      "filter": {
        "pinyin": {
          "type": "icu_transform",
          "id": "Han-Latin"
        }
      },
      "analyzer": {
        "autocomplete": {
          "tokenizer": "keyword",
          "filter": [
            "pinyin",
            "lowercase",
            "cjk_width"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "search_text": {
          "type": "string",
          "analyzer": "smartcn"
        },
        "aggs_text": {
          "type": "string",
          "index": "not_analyzed"
        },
        "suggest_text": {
          "type": "completion",
          "index_analyzer": "autocomplete",
          "search_analyzer": "autocomplete",
          "preserve_separators": false
        }
      }
    }
  }
}

And for Japanese, this:

PUT /japanese
{
  "settings": {
    "analysis": {
      "filter": {
        "romaji": {
          "type": "kuromoji_readingform",
          "use_romaji": true
        }
      },
      "analyzer": {
        "autocomplete": {
          "tokenizer": "kuromoji",
          "filter": [
            "lowercase",
            "cjk_width",
            "romaji"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "search_text": {
          "type": "string",
          "analyzer": "kuromoji"
        },
        "aggs_text": {
          "type": "string",
          "index": "not_analyzed"
        },
        "suggest_text": {
          "type": "completion",
          "index_analyzer": "autocomplete",
          "search_analyzer": "autocomplete",
          "preserve_separators": false
        }
      }
    }
  }
}

Another issue that you may haven't encountered up until now is making autocomplete work in the browser. The problem is that eg Chinese users need to type several characters to produce a single pictogram, but the browser will only fire the keypress event once the whole pictogram has been entered. Really you want to intercept the keypresses earlier in the process.

This Stack Overflow question may help point you in the right direction: http://stackoverflow.com/questions/7316886/detecting-ime-input-before-enter-pressed-in-javascript

Morriaty · May 5, 2016, 3:06am

Hi, @shaunak!

The approach helps me a lot on Chinese search suggestions. Thank you very much. But I still met some problems.

For example, I have some docs contained a token 代金券, which means coupon, spells dai jin quan in pinyin.

Then I used _suggest to search a misspelled token 代经券, which spells dai jing quan in pinyin.

There was no hit of this query. So what's the problem?

Thank you for your help

Morriaty · May 5, 2016, 6:57am

I got know. The completion suggester does not do spell correction.

So how can I do Chinese search suggestions with term/phrase suggester?

medcl.net · May 6, 2016, 11:56am

Hi @Morriaty
There is no difference with Chinese and English, to your problem
you should config fuzziness and min_length to made the suggester return the 代金券 by input 代经券,
check out this reference:
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-suggesters-completion.html#fuzzy

Morriaty · May 9, 2016, 2:14am

@medcl1 thank you for you reply!

I have tried the fuzzy query, but I don't understand what the parameter min_length mean.

min_length Minimum length of the input before fuzzy suggestions are returned, defaults 3

For example, I searched with input 大, of which character length is 1 and pinyin length is 2, with min_length set to 3.
I expected there should be no hit because the length of the input is lower than the min_length. However, I got the right results.

{
  "suggest": {
    "text": "大",
    "full": {
      "completion": {
          "field" : "name.full_pinyin",
          "fuzzy": {
            "fuziness": 2,
            "min_length": 3
          }
      }
    },
  "size": 0
}


"suggest": {
    "full": [
      {
        "text": "大",
        "offset": 0,
        "length": 1,
        "options": [
          {
            "text": "带开关插座",
            "score": 4
          },
          {
            "text": "大自然",
            "score": 2
          },
          {
            "text": "大自然地板",
            "score": 2
          },
          {
            "text": "代金券",
            "score": 1
          }
        ]
      }
    ]
  }

Morriaty · May 9, 2016, 2:35am

I understand, 大 was exact query, not fuzzy query.

Thank you all the same, @medcl1!

Topic		Replies	Views
Kanji Support in Elastic Search Elasticsearch	3	714	July 6, 2017
Can I mix English and Chinese in a search with elasticsearch Chinese analysis plugin? Elasticsearch	2	612	July 6, 2017
Help me! Search kanji and hikarana (kuromoji and Fuzziness) Elasticsearch	3	498	January 17, 2022
Combo analyzer - Issue with English and Japanese text being stored in same fields Elasticsearch	5	1763	July 6, 2017
Elasticsearch index creation and searching on japanese data Elasticsearch	5	911	July 6, 2017

What are the best approach for Chinese/Japanese language indexing and searching?

Related topics