Jörg,
You should know beforehand what locales you use, otherwise you will run
into trouble, because adding locales on the fly takes much consideration.
Of course! My customers always told me the locale(s) of the data feed. Very
important to know up-front, as you state.
*Beside stemming, look into ICU folding. This normalizes many characters,
independent of language.*
In my C++ engine, I used the ICU's collation for response sorting, and for
most of the search indexing. But for searching, I often had to add
character equivalencies (for Finnish: Å = O, and W = V)... ES does this
with character mapping (it's like they read my mind!). And for sorting, Å
sorted after Z and W sorted after V as it should, because of the fi_FI
locale used when generating the ICU collation key. (Well, after enough
versions went by to get the V,W collation order correct).
*For sorting, here is a short ICU collation demo
Norwegian Bokmål sort with Elasticsearch · GitHub*
This is very good. Thanks!
But one question: It seems (to my untrained eye) that it implies that the
analyzed terms are used to sort as well as to match during a query. Is this
correct, or did I miss something?
For Finnish, the matching rules and the sorting rules are very different
for the cases above. But for all of the other languages I supported, it was
acceptable that the same collation key could be used for both matching
during a query and for sorting the responses.
In ES, I emulate this for Finnish by setting up the character mapping for
those characters, and then using just the locale-based Java collation key
for my own post-query response sorting. For now, anyway.
Brian
On Sunday, August 4, 2013 2:05:11 PM UTC-4, Jörg Prante wrote:
You should know beforehand what locales you use, otherwise you will run
into trouble, because adding locales on the fly takes much consideration.
Beside stemming, look into ICU folding. This normalizes many characters,
independent of language.
For sorting, here is a short ICU collation demo
Norwegian Bokmål sort with Elasticsearch · GitHub
For spell check, I recommend hunspell spell check. This kind of spell
check is only availaible outside ES at the moment. Once I started a
dictionary effort for dictionary-based spell checking, when hunspell in
Lucene was quite broken, but I will pick it up again when ES 1.0.0 is in
sight.
Jörg
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.