ICU Analysers for Elastic search

Akhil_Suresh · January 19, 2016, 1:07pm

Do Elastic search support a korean language Analyser? Need help on that

igor_k · January 19, 2016, 1:19pm

When I worked at Egnyte we where able to tokenize Korean using ICU Tokenizer. Please take a look at this blog post https://www.egnyte.com/blog/2015/07/indexing-multilingual-documents-with-elasticsearch/

In general ICU will let you tokenize langauges where words are not space delimited (like Korean) and will fold national character to their ascii versions (like in French or Polish, é --> e).

Hope this helps.

Thanks,
Igor

Akhil_Suresh · January 19, 2016, 1:24pm

Thanks @igor_k for the response.

This is how i used the language analyzer. I am not able to query out all korean words. Some of them are ok. Please help if any modifications required.

  analysis: {
    char_filter: {
      hyphen_mapping: {
        type: "mapping",
        mappings: [
          "-=>"
        ]
      }
    },
    filter: {
      korean_collation: {
        type: "icu_collation",
        language: "ko",
        country: "KR",
        decomposition: "canonical"
      }
    },
    analyzer: {
      custom_with_char_filter: {
        tokenizer: "standard",
        char_filter: [
          "hyphen_mapping"
        ],
        filter: ["standard", "lowercase", "stop", "porter_stem"]
      },
      korean: {
        tokenizer: "icu_tokenizer",
        char_filter: [
          "hyphen_mapping"
        ],
         filter: ["icu_normalizer", "lowercase", "stop", "porter_stem", "korean_collation"]
      }

    }
  }
},
mappings: {
  document: {
    properties: {

igor_k · January 21, 2016, 8:31am

Hi, I never tried to stem Korean words. I think the issue is in your pipeline of filter. You have porter_stem, but its web page suggests it is english-only stemmer.

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English.

Try removing it. Also, you can start simple, with icu_tokenizer and icu_folding and see where that will lead you. For example if you use folding you do not need to use lowercase filter.

You can start with this example https://www.found.no/play/gist/81780a22b33efa60f439 and try your Korean searches there (I do not know Korean, so it is hard for me to give more than a general tips). And then you can build it up if you need more fancy features.

Hope this helps,
Igor

Akhil_Suresh · February 22, 2016, 6:36am

Thanks @igor_k Partial text search for Korean text is not working . For eg: if we search "에프알엘코리아" we will get 100 results but if we search "에프알" i am not getting any results. This text belong to a field name "sections". Do i need to add any particular analyzer for this particular field to enable partial text search? Please help

Topic		Replies	Views
Lang (czech) analyzer with asciifolding tokenizer or icu_tokenizer Elasticsearch	10	1144	July 6, 2017
Icu_collation as keyword normalizer Elasticsearch	3	1123	March 13, 2017
ICU Analysis Plugin doesn't normalize some characters like other languages(PHP, Python) Elasticsearch	1	211	October 10, 2022
Plugin elasticsearch ICU et le langage français Discussions en français	1	798	June 26, 2017
How to sort Norwegian special characters with the ICU plugin? Elasticsearch	6	1392	July 6, 2017

ICU Analysers for Elastic search

Related topics