tl/dr: asking for input on the "right" way to support lots of languages
The time has come for me to support more languages! I have some ideas
about how to do this but I'd love some advice. Background: my install base
has ~300 languages and my searching supports the concept of both "plain"
and "aggressive" analyzers. Each index only supports a single language.
For languages for which I don't have a special case I'll make a single
"filter": [ "standard", "icu_normalizer", "lowercase" ]
For languages that have a stemmer I'll use the that analyzer above as a
"plain" analyzer and something like this (custom per language) for the
"filter": [ "standard", "icu_normalizer", "possessive_english",
"lowercase", "stop", "kstem", "asciifolding" ]
Not all languages will want asciifolding (but we're used to it in English)
and not all languages have stop words. Bonus: Some of my install base uses
word_delimiter in the aggressive analyzer as well!
For some languages I think I'll need to replace the plain analyzer with a
weakened version of the custom analyzer - Japanese will need the kuromoji
tokenizer, for example.
Does this make sense?
Should I spend time investigating using the icu_tokenizer instead of the
Are there any analysis plugins that I should look beyond ICU, Smart-CN,
Stempel, and Kuromoji?
I saw some talk about a Hebrew plugin but that plugin isn't listed on
Elasticsearch's plugins page. Is it useful/ready?
I'm happy to use any plugin so long as it has some open source license, is
actively supported by someone who speaks the language, and has instructions
in English. I assume plugins always have instructions in their native
language, but I need some in English too.
Thanks for reading! Please tell me all the mistakes I'm about to make!
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firstname.lastname@example.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0R8DG%3DOUtPzW_277%3DSChykrmm0U5rZu_p8-qJB9id2zQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.