Do Elastic search support a korean language Analyser? Need help on that
Hi @Akhil_Suresh,
When I worked at Egnyte we where able to tokenize Korean using ICU Tokenizer. Please take a look at this blog post https://www.egnyte.com/blog/2015/07/indexing-multilingual-documents-with-elasticsearch/
In general ICU will let you tokenize langauges where words are not space delimited (like Korean) and will fold national character to their ascii versions (like in French or Polish, é --> e
).
Hope this helps.
Thanks,
Igor
Thanks @igor_k for the response.
This is how i used the language analyzer. I am not able to query out all korean words. Some of them are ok. Please help if any modifications required.
analysis: {
char_filter: {
hyphen_mapping: {
type: "mapping",
mappings: [
"-=>"
]
}
},
filter: {
korean_collation: {
type: "icu_collation",
language: "ko",
country: "KR",
decomposition: "canonical"
}
},
analyzer: {
custom_with_char_filter: {
tokenizer: "standard",
char_filter: [
"hyphen_mapping"
],
filter: ["standard", "lowercase", "stop", "porter_stem"]
},
korean: {
tokenizer: "icu_tokenizer",
char_filter: [
"hyphen_mapping"
],
filter: ["icu_normalizer", "lowercase", "stop", "porter_stem", "korean_collation"]
}
}
}
},
mappings: {
document: {
properties: {
Hi, I never tried to stem Korean words. I think the issue is in your pipeline of filter. You have porter_stem
, but its web page suggests it is english-only stemmer.
The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English.
Try removing it. Also, you can start simple, with icu_tokenizer
and icu_folding
and see where that will lead you. For example if you use folding you do not need to use lowercase
filter.
You can start with this example https://www.found.no/play/gist/81780a22b33efa60f439 and try your Korean searches there (I do not know Korean, so it is hard for me to give more than a general tips). And then you can build it up if you need more fancy features.
Hope this helps,
Igor
Thanks @igor_k Partial text search for Korean text is not working . For eg: if we search "에프알엘코리아" we will get 100 results but if we search "에프알" i am not getting any results. This text belong to a field name "sections". Do i need to add any particular analyzer for this particular field to enable partial text search? Please help