ICU Analysers for Elastic search


(Akhil Suresh) #1

Do Elastic search support a korean language Analyser? Need help on that


(Igor Kupczyński) #2

Hi @Akhil_Suresh,

When I worked at Egnyte we where able to tokenize Korean using ICU Tokenizer. Please take a look at this blog post https://www.egnyte.com/blog/2015/07/indexing-multilingual-documents-with-elasticsearch/

In general ICU will let you tokenize langauges where words are not space delimited (like Korean) and will fold national character to their ascii versions (like in French or Polish, é --> e).

Hope this helps.

Thanks,
Igor


(Akhil Suresh) #3

Thanks @igor_k for the response.

This is how i used the language analyzer. I am not able to query out all korean words. Some of them are ok. Please help if any modifications required.

  analysis: {
    char_filter: {
      hyphen_mapping: {
        type: "mapping",
        mappings: [
          "-=>"
        ]
      }
    },
    filter: {
      korean_collation: {
        type: "icu_collation",
        language: "ko",
        country: "KR",
        decomposition: "canonical"
      }
    },
    analyzer: {
      custom_with_char_filter: {
        tokenizer: "standard",
        char_filter: [
          "hyphen_mapping"
        ],
        filter: ["standard", "lowercase", "stop", "porter_stem"]
      },
      korean: {
        tokenizer: "icu_tokenizer",
        char_filter: [
          "hyphen_mapping"
        ],
         filter: ["icu_normalizer", "lowercase", "stop", "porter_stem", "korean_collation"]
      }

    }
  }
},
mappings: {
  document: {
    properties: {

(Igor Kupczyński) #4

Hi, I never tried to stem Korean words. I think the issue is in your pipeline of filter. You have porter_stem, but its web page suggests it is english-only stemmer.

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English.

Try removing it. Also, you can start simple, with icu_tokenizer and icu_folding and see where that will lead you. For example if you use folding you do not need to use lowercase filter.

You can start with this example https://www.found.no/play/gist/81780a22b33efa60f439 and try your Korean searches there (I do not know Korean, so it is hard for me to give more than a general tips). And then you can build it up if you need more fancy features.

Hope this helps,
Igor


(Akhil Suresh) #5

Thanks @igor_k Partial text search for Korean text is not working . For eg: if we search "에프알엘코리아" we will get 100 results but if we search "에프알" i am not getting any results. This text belong to a field name "sections". Do i need to add any particular analyzer for this particular field to enable partial text search? Please help


(system) #6