Best way to index multiple languages


(Alexandre) #1

Hi,

I am trying to figure out gow to index the following in ES.

I have documents representing location names (geonames.org). Each
document has a category such as Airport, restaurant, river, beach
etc ..
I have translated the categories in 25 languages. For each document I
want to add the translated categories.

Should I :

  1. create 1 field per category translation and use a different
    analyzer with different language for each categroy field ?
  2. create 1 general category object field with an array of
    translations ... in that case how should I set analysis ?
  3. do it some other way I am not aware of :slight_smile: ?

Bonus question : I also need to do multi language querying ... so a
french person will query "Aéroport de Genève" but an english person
will query "Geneva Airport" and chinese person will query "" in their own language etc ... Is there
something special I need to do to build the query so that we get the
best combination of location name and category hit?

Many thanks for your answers !


(Jan Fiedler) #2

No real production experience yet but I am using the approach of having one
field per language (with language specific analyzer configurations attached
to them).

For the bonus question: Often you will have some context in your app that
would define the language (e.g. user selecting language for their browsing
session as the remaining page content most likely will have to show the
correct language too). In this case it would be trivial to select the
correct field in the query. Without session context you could use language
detection on the user input and select the correct field based on that. If
you do not have a language detection library, you could try to run the
search across all language fields (this may generate some noise though).


(Otis Gospodnetić) #3

Alexandre,

Check the ML archive, I just asked a similar question the other day
and the approach we'll be taking is the one where we have a single set
of fields and at index-time we explicitly specify the analyzer for
each field depending on the field or document language. We'll be
using our own Language Identifier library for that (http://
sematext.com/products/language-identifier/index.html). You could use
a Language Identifier to detect query language, too. Precision may
suffer if queries are very short or ambiguous (is "die" an English
verb? Or English noun? Or a German article?), though this can be
addressed through UI, giving people options to select from one or a
few guessed languages, allowing people to permanently store/remember
their language selection and such.

Otis

Sematext is hiring ElasticSearch / Solr developers --

On Jan 17, 11:20 am, Jan Fiedler fiedler....@gmail.com wrote:

No real production experience yet but I am using the approach of having one
field per language (with language specific analyzer configurations attached
to them).

For the bonus question: Often you will have some context in your app that
would define the language (e.g. user selecting language for their browsing
session as the remaining page content most likely will have to show the
correct language too). In this case it would be trivial to select the
correct field in the query. Without session context you could use language
detection on the user input and select the correct field based on that. If
you do not have a language detection library, you could try to run the
search across all language fields (this may generate some noise though).


(Shay Banon) #4

It really depends, there are several ways to do it. You can create an index
per language, with its own mapping that has lang analyzer specific on the
relevant field. Another option is to use multiple field names for each
language, each with its own analyzer associated with it. Those are usually
the best two options.

On Tue, Jan 17, 2012 at 6:20 PM, Jan Fiedler fiedler.jan@gmail.com wrote:

No real production experience yet but I am using the approach of having
one field per language (with language specific analyzer configurations
attached to them).

For the bonus question: Often you will have some context in your app that
would define the language (e.g. user selecting language for their browsing
session as the remaining page content most likely will have to show the
correct language too). In this case it would be trivial to select the
correct field in the query. Without session context you could use language
detection on the user input and select the correct field based on that. If
you do not have a language detection library, you could try to run the
search across all language fields (this may generate some noise though).


(Alexandre) #5

Many thanks to all for your insights !

I started working through the field per language approach. I guess
I'll be using variants of snowball analyzer for each language when
available and a regular LanguageAnalyzer when not.

For the query part I can play both with a user setting and query
language identification (whenever possible).

You were really helpful !

Cheers!

Alex


(Jussi Arpalahti) #6

On 21 January 2012 15:35, Alexandre azlist1@gmail.com wrote:

Many thanks to all for your insights !

I started working through the field per language approach. I guess
I'll be using variants of snowball analyzer for each language when
available and a regular LanguageAnalyzer when not.

For the query part I can play both with a user setting and query
language identification (whenever possible).

Hi.

Perhaps this is only relevant for a language like finnish, but we have
indexed two versions of every field. One type with no language analyzing
for exact matches and the other with language analyzer for expanded matches.

Documents are build like this:
title: "some text" -> to default analyzer
title_en: "some text" -> to english analyzer
language: english

When searching the language independent field is given a slight boost over
the linguistically analyzed field. Our documents only containt text in one
language and are indexed using a language field. Searches are then filtered
by this field on the user's chosen language.

FYI, finnish is a somewhat complex language to search for. As wikipedia
says "it modifies inflects http://en.wikipedia.org/wiki/Inflection the
forms of nouns http://en.wikipedia.org/wiki/Noun,
adjectiveshttp://en.wikipedia.org/wiki/Adjective,
pronouns http://en.wikipedia.org/wiki/Pronoun,
numeralshttp://en.wikipedia.org/wiki/Number_namesand
verbs http://en.wikipedia.org/wiki/Verb, depending on their roles in the
sentence http://en.wikipedia.org/wiki/Sentence_(linguistics)." Thus
we need to index the word as it is in the document but also in its base
form so user does not have to match the dozens of variations of the word.
We have also used this indexing strategy for english and swedish words.
However we don't yet have enough experience with the search service to say
if this is a good approach for these languages.


(Otis Gospodnetić) #7

Hi Shay,

There is also the option of specifying an Analyzer for each field at
index-time, right?
Are there some drawback to this approach that makes you say that index
per language and field set per language are usually the best the
options?

Thanks,
Otis

Sematext is hiring ElasticSearch / Solr developers - http://sematext.com/about/jobs.html

On Jan 18, 4:06 pm, Shay Banon kim...@gmail.com wrote:

It really depends, there are several ways to do it. You can create an index
per language, with its own mapping that has lang analyzer specific on the
relevant field. Another option is to use multiple field names for each
language, each with its own analyzer associated with it. Those are usually
the best two options.

On Tue, Jan 17, 2012 at 6:20 PM, Jan Fiedler fiedler....@gmail.com wrote:

No real production experience yet but I am using the approach of having
one field per language (with language specific analyzer configurations
attached to them).

For the bonus question: Often you will have some context in your app that
would define the language (e.g. user selecting language for their browsing
session as the remaining page content most likely will have to show the
correct language too). In this case it would be trivial to select the
correct field in the query. Without session context you could use language
detection on the user input and select the correct field based on that. If
you do not have a language detection library, you could try to run the
search across all language fields (this may generate some noise though).


(Shay Banon) #8

Its just the fact that a field will now have its terms produced by
different analyzers. Can certainly be used, but I would prefer to separate
it.

On Mon, Jan 23, 2012 at 9:53 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

Hi Shay,

There is also the option of specifying an Analyzer for each field at
index-time, right?
Are there some drawback to this approach that makes you say that index
per language and field set per language are usually the best the
options?

Thanks,
Otis

Sematext is hiring ElasticSearch / Solr developers -
http://sematext.com/about/jobs.html

On Jan 18, 4:06 pm, Shay Banon kim...@gmail.com wrote:

It really depends, there are several ways to do it. You can create an
index
per language, with its own mapping that has lang analyzer specific on the
relevant field. Another option is to use multiple field names for each
language, each with its own analyzer associated with it. Those are
usually
the best two options.

On Tue, Jan 17, 2012 at 6:20 PM, Jan Fiedler fiedler....@gmail.com
wrote:

No real production experience yet but I am using the approach of having
one field per language (with language specific analyzer configurations
attached to them).

For the bonus question: Often you will have some context in your app
that

would define the language (e.g. user selecting language for their
browsing

session as the remaining page content most likely will have to show the
correct language too). In this case it would be trivial to select the
correct field in the query. Without session context you could use
language

detection on the user input and select the correct field based on
that. If

you do not have a language detection library, you could try to run the
search across all language fields (this may generate some noise
though).


(Mark Waddle) #9

I agree with Shay. I would think that having separate indices would be
better, especially if you ever plan to use the _all fieldhttp://www.elasticsearch.org/guide/reference/mapping/all-field.html
.


(system) #10