Multi-lingual ES

A requirement has come in to internationlize our ES index, what will I need
to do make sure the ES index supports non-english languages. Here are some
things I thought of, are there any more?

Text analysis to make sure that is a user searches for a word without
diacritics then it is found.
Support sorting on non english alphabets.
Stemming, suggestions and spellcheck on non english words.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

making something like your req. working well can be tricky, do you have
some more infos how your app works? For instance do you have knowledge
about what language the user is searching in or is you content translated
into different languages?

On Saturday, August 3, 2013 10:02:29 AM UTC+2, dan wrote:

A requirement has come in to internationlize our ES index, what will I
need to do make sure the ES index supports non-english languages. Here are
some things I thought of, are there any more?

Text analysis to make sure that is a user searches for a word without
diacritics then it is found.

I think it is needed anyways so I'd say this is a good one :slight_smile:

Support sorting on non english alphabets.

you mean depending on the users locale or something like this? One option
is ICU collation based sorting unless you wanna rely on unicode sort
order.

Stemming, suggestions and spellcheck on non english words.

I'd recommend to evaluate stemming case by case I am not sure if you really
wanna apply this generally. Can you talk a little about the languages you
wanna search on as well? In general the ICU analysis capabilities might
help you with your
requirements. Elasticsearch Platform — Find real-time answers at scale | Elastic

simon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Simon.

Thanks for your reply.

We do not know what languages will be in the index, the documents are
pushed into the index from an ingestion pipeline, but the locale will be
indexed with the document. I was thinking we would pass a locale parameter
with the request and filter the results base on the locale value.

But that leaves stemming, sorting and spell check. I think this will have
to be on a case by case basis as you say.

I will check out the links you posted.

Thanks.

On Sunday, August 4, 2013 6:58:30 AM UTC+1, simonw wrote:

Hey,

making something like your req. working well can be tricky, do you have
some more infos how your app works? For instance do you have knowledge
about what language the user is searching in or is you content translated
into different languages?

On Saturday, August 3, 2013 10:02:29 AM UTC+2, dan wrote:

A requirement has come in to internationlize our ES index, what will I
need to do make sure the ES index supports non-english languages. Here are
some things I thought of, are there any more?

Text analysis to make sure that is a user searches for a word without
diacritics then it is found.

I think it is needed anyways so I'd say this is a good one :slight_smile:

Support sorting on non english alphabets.

you mean depending on the users locale or something like this? One option
is ICU collation based sorting unless you wanna rely on unicode sort
order.

Stemming, suggestions and spellcheck on non english words.

I'd recommend to evaluate stemming case by case I am not sure if you
really wanna apply this generally. Can you talk a little about the
languages you wanna search on as well? In general the ICU analysis
capabilities might help you with your requirements.
Elasticsearch Platform — Find real-time answers at scale | Elastic

simon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We are doing something similar, and taking advantage of the document's
language (we detect it ourselves) and the _analyzer field when indexing for
selecting an indexing analyzer

Then the query is also assigned a language, which selects a querying
analyzer, which uses the same stemming algorithm as the matching indexing
analyzer plus some extra optimizations

This is only part of the solution, I hope to be blogging about the entire
solution soon

On Sun, Aug 4, 2013 at 2:05 PM, dan dan.tuffery@gmail.com wrote:

Hi Simon.

Thanks for your reply.

We do not know what languages will be in the index, the documents are
pushed into the index from an ingestion pipeline, but the locale will be
indexed with the document. I was thinking we would pass a locale parameter
with the request and filter the results base on the locale value.

But that leaves stemming, sorting and spell check. I think this will have
to be on a case by case basis as you say.

I will check out the links you posted.

Thanks.

On Sunday, August 4, 2013 6:58:30 AM UTC+1, simonw wrote:

Hey,

making something like your req. working well can be tricky, do you have
some more infos how your app works? For instance do you have knowledge
about what language the user is searching in or is you content translated
into different languages?

On Saturday, August 3, 2013 10:02:29 AM UTC+2, dan wrote:

A requirement has come in to internationlize our ES index, what will I
need to do make sure the ES index supports non-english languages. Here are
some things I thought of, are there any more?

Text analysis to make sure that is a user searches for a word without
diacritics then it is found.

I think it is needed anyways so I'd say this is a good one :slight_smile:

Support sorting on non english alphabets.

you mean depending on the users locale or something like this? One option
is ICU collation based sorting unless you wanna rely on unicode sort
order.

Stemming, suggestions and spellcheck on non english words.

I'd recommend to evaluate stemming case by case I am not sure if you
really wanna apply this generally. Can you talk a little about the
languages you wanna search on as well? In general the ICU analysis
capabilities might help you with your requirements. http://www.**
Elasticsearch Platform — Find real-time answers at scale | Elastichttp://www.elasticsearch.org/guide/reference/index-modules/analysis/icu-plugin/

simon

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

This is a very wide and deep subject. But in general, ES has all the
facilities to do this just fine (and vastly better than most), and the ICU
isn't needed (though it can help if desired, I would guess).

ES does a nice job of sorting by rank; I actually built a post-sorting that
uses the Java collation for String fields, and implements an SQL-like ORDER
BY syntax. Works fine, and gives me more control

Be aware of your users. We English-speaking folks think of a letter as
being the same with or without an accent. But in other languages, these are
very different letters. For example, would an English speaking person refer
to T as "I with a bar" and expect them to match each other? No. So for
another example, in Finnish the Å character is expected to match O and not
A. But but the expect Å to sort after Z. If you match Å with A and sort it
with A, the Finns will reject your solution and choose a real partner to
work with them. (They were supberb customers, and I learned a great deal
about cultural matching during queries, sorting responses, and geospatial
behaviors from them.)

For my test purposes, I set up a small index with English, Finnish, and
Chinese. I set up table-based synonyms (works fine) so that, for example,
"doctor" matches the phrase "醫生". I also experimented with multi-mapping so
that, for example, the cn field is common name, but cn.fi is the common
name with Finnish analysis rules. There's only one copy of the cn field in
the _source; ES just indexes it in different ways. So I can query cn:osaand I won't find
Åsa, but if I query cn.fi:osa I will find her.

Brian

On Saturday, August 3, 2013 4:02:29 AM UTC-4, dan wrote:

A requirement has come in to internationlize our ES index, what will I
need to do make sure the ES index supports non-english languages. Here are
some things I thought of, are there any more?

Text analysis to make sure that is a user searches for a word without
diacritics then it is found.
Support sorting on non english alphabets.
Stemming, suggestions and spellcheck on non english words.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1 Like

You should know beforehand what locales you use, otherwise you will run
into trouble, because adding locales on the fly takes much consideration.

Beside stemming, look into ICU folding. This normalizes many characters,
independent of language.

For sorting, here is a short ICU collation demo

For spell check, I recommend hunspell spell check. This kind of spell check
is only availaible outside ES at the moment. Once I started a dictionary
effort for dictionary-based spell checking, when hunspell in Lucene was
quite broken, but I will pick it up again when ES 1.0.0 is in sight.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

1 Like

Jörg,

You should know beforehand what locales you use, otherwise you will run
into trouble, because adding locales on the fly takes much consideration
.

Of course! My customers always told me the locale(s) of the data feed. Very
important to know up-front, as you state.

*Beside stemming, look into ICU folding. This normalizes many characters,

independent of language.*

In my C++ engine, I used the ICU's collation for response sorting, and for
most of the search indexing. But for searching, I often had to add
character equivalencies (for Finnish: Å = O, and W = V)... ES does this
with character mapping (it's like they read my mind!). And for sorting, Å
sorted after Z and W sorted after V as it should, because of the fi_FI
locale used when generating the ICU collation key. (Well, after enough
versions went by to get the V,W collation order correct).

*For sorting, here is a short ICU collation demo

Norwegian Bokmål sort with Elasticsearch · GitHub*

This is very good. Thanks!

But one question: It seems (to my untrained eye) that it implies that the
analyzed terms are used to sort as well as to match during a query. Is this
correct, or did I miss something?

For Finnish, the matching rules and the sorting rules are very different
for the cases above. But for all of the other languages I supported, it was
acceptable that the same collation key could be used for both matching
during a query and for sorting the responses.

In ES, I emulate this for Finnish by setting up the character mapping for
those characters, and then using just the locale-based Java collation key
for my own post-query response sorting. For now, anyway.

Brian

On Sunday, August 4, 2013 2:05:11 PM UTC-4, Jörg Prante wrote:

You should know beforehand what locales you use, otherwise you will run
into trouble, because adding locales on the fly takes much consideration.

Beside stemming, look into ICU folding. This normalizes many characters,
independent of language.

For sorting, here is a short ICU collation demo
Norwegian Bokmål sort with Elasticsearch · GitHub

For spell check, I recommend hunspell spell check. This kind of spell
check is only availaible outside ES at the moment. Once I started a
dictionary effort for dictionary-based spell checking, when hunspell in
Lucene was quite broken, but I will pick it up again when ES 1.0.0 is in
sight.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It's just sorting. The icu_collation analyzer creates binary sort keys.

For search, ICU folding should work, like in this short demonstration

ICU folding works on Unicode character points, so many languages are
covered quite well, unless you need additional special rules (like in
german, where umlaut expansion is common: Köln -> Koln and Köln -> Koeln
for example are equivalent forms of Köln).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks, Jörg

In the past, I just used ICU collation keys of primary strength for both
searching and sorting. But for Finnish (in particular), I first performed
character equivalency mapping for the search side, but not for the sorting
side. Worked without complaint from Finland to Belgium to Greece to
Australia. The information
at ICU User Guide | ICU Documentation brings back memories
of the earlier ICU versions I used (2.4 and perhaps 2.6, but no higher...
this was over 10 years ago).

And since the Finns, who were the most demanding customers (and the
customers I worked closest with... that was an awesome experience!), were
very pleased with the end result, I didn't go further. Including not
investigating ICU folding vrs. my collation-based strategy.

Will look into your gist in more detail, once I figure out how to package
and install ES + ICU without needing an internet connection on each
machine. :slight_smile:

Brian

On Monday, August 5, 2013 2:48:14 AM UTC-4, Jörg Prante wrote:

It's just sorting. The icu_collation analyzer creates binary sort keys.

For search, ICU folding should work, like in this short demonstration
ICU folding tokenizer filter in action · GitHub

ICU folding works on Unicode character points, so many languages are
covered quite well, unless you need additional special rules (like in
german, where umlaut expansion is common: Köln -> Koln and Köln -> Koeln
for example are equivalent forms of Köln).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.