Supporting as many languages as possible


(Nik Everett) #1

tl/dr: asking for input on the "right" way to support lots of languages

The time has come for me to support more languages! I have some ideas
about how to do this but I'd love some advice. Background: my install base
has ~300 languages and my searching supports the concept of both "plain"
and "aggressive" analyzers. Each index only supports a single language.
My thoughts:

For languages for which I don't have a special case I'll make a single
analyzer:
{
"type": "custom",
"tokenizer": "standard",
"filter": [ "standard", "icu_normalizer", "lowercase" ]
}

For languages that have a stemmer I'll use the that analyzer above as a
"plain" analyzer and something like this (custom per language) for the
"aggressive" analyzer:
{
"type": "custom",
"tokenizer": "standard",
"filter": [ "standard", "icu_normalizer", "possessive_english",
"lowercase", "stop", "kstem", "asciifolding" ]
}

Not all languages will want asciifolding (but we're used to it in English)
and not all languages have stop words. Bonus: Some of my install base uses
word_delimiter in the aggressive analyzer as well!

For some languages I think I'll need to replace the plain analyzer with a
weakened version of the custom analyzer - Japanese will need the kuromoji
tokenizer, for example.

Questions:
Does this make sense?
Should I spend time investigating using the icu_tokenizer instead of the
standard tokenizer?
Are there any analysis plugins that I should look beyond ICU, Smart-CN,
Stempel, and Kuromoji?
I saw some talk about a Hebrew plugin but that plugin isn't listed on
Elasticsearch's plugins page. Is it useful/ready?

Assertion:
I'm happy to use any plugin so long as it has some open source license, is
actively supported by someone who speaks the language, and has instructions
in English. I assume plugins always have instructions in their native
language, but I need some in English too.

Thanks for reading! Please tell me all the mistakes I'm about to make!

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0R8DG%3DOUtPzW_277%3DSChykrmm0U5rZu_p8-qJB9id2zQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #2