Multilingual field handling with multiple fields in ES

Hi all,

We have a document index which has a number of fields such as 'title',
'description', 'searches' etc. These can all be provided in the input from
any locale e.g. i could put in these 3 fields in english and some one else
add them in french etc.

I know there have been lots of posts on the group regarding:

  1. Setting an index analyzer in advance to analyze 'all content' (won't
    work with multiple languages on input)
  2. Using multiple indexes - 1 for each language (would prefer not to have
    to do this from an index/alias maintenance point of view)
  3. Using multiple fields (for each language) and analyzing per field (e.g.
    have a field for title-fr, title-de, title-en and separate the data with a
    specific field analyzer for each field). This would have the overhead of
    having to explicitly create the mapping for a number of fields * the number
    of languages you want to support
  4. Using single fields (all languages in one) and analyzing per
    language(e.g. have 1 field and then just set the analyzer on indexing that
    piece of content). This seems like the cleanest solution but i'm wondering
    if there is any search/indexing issue with having multi-lingual terms
    within 1 field.
  5. Multi-field values (similar to point 3) but explicitly using multi-field
    instead of multiple separated fields.

I'm leaning towards number 4 but would appreciate any feedback people would
have from experience,

Thanks,

Derry

--

Hi Derry,

I'm facing the same problem but I'm leaning towards 5 simply because
there's only a handful of languages that we're using in practice. Can you
elaborate more on 4?

On Wednesday, November 28, 2012 5:58:41 PM UTC+2, Derry O' Sullivan wrote:

Hi all,

We have a document index which has a number of fields such as 'title',
'description', 'searches' etc. These can all be provided in the input from
any locale e.g. i could put in these 3 fields in english and some one else
add them in french etc.

I know there have been lots of posts on the group regarding:

  1. Setting an index analyzer in advance to analyze 'all content' (won't
    work with multiple languages on input)
  2. Using multiple indexes - 1 for each language (would prefer not to have
    to do this from an index/alias maintenance point of view)
  3. Using multiple fields (for each language) and analyzing per field (e.g.
    have a field for title-fr, title-de, title-en and separate the data with a
    specific field analyzer for each field). This would have the overhead of
    having to explicitly create the mapping for a number of fields * the number
    of languages you want to support
  4. Using single fields (all languages in one) and analyzing per
    language(e.g. have 1 field and then just set the analyzer on indexing that
    piece of content). This seems like the cleanest solution but i'm wondering
    if there is any search/indexing issue with having multi-lingual terms
    within 1 field.
  5. Multi-field values (similar to point 3) but explicitly using
    multi-field instead of multiple separated fields.

I'm leaning towards number 4 but would appreciate any feedback people
would have from experience,

Thanks,

Derry

--

Hi,

Hi,
I am also having same requirement in my project.
So I agree with your point 4.
Have you tried this ? Is it work for you?
Actually I tried with JAVA API but not able to do this..
So if you done this can you please guide me and provide sample code part to
do this with Java API

Thanks

--
Regards
Sapana Patel

On Wednesday, November 28, 2012 9:28:41 PM UTC+5:30, Derry O' Sullivan
wrote:

Hi all,

We have a document index which has a number of fields such as 'title',
'description', 'searches' etc. These can all be provided in the input from
any locale e.g. i could put in these 3 fields in english and some one else
add them in french etc.

I know there have been lots of posts on the group regarding:

  1. Setting an index analyzer in advance to analyze 'all content' (won't
    work with multiple languages on input)
  2. Using multiple indexes - 1 for each language (would prefer not to have
    to do this from an index/alias maintenance point of view)
  3. Using multiple fields (for each language) and analyzing per field (e.g.
    have a field for title-fr, title-de, title-en and separate the data with a
    specific field analyzer for each field). This would have the overhead of
    having to explicitly create the mapping for a number of fields * the number
    of languages you want to support
  4. Using single fields (all languages in one) and analyzing per
    language(e.g. have 1 field and then just set the analyzer on indexing that
    piece of content). This seems like the cleanest solution but i'm wondering
    if there is any search/indexing issue with having multi-lingual terms
    within 1 field.
  5. Multi-field values (similar to point 3) but explicitly using
    multi-field instead of multiple separated fields.

I'm leaning towards number 4 but would appreciate any feedback people
would have from experience,

Thanks,

Derry

--

Sorry for the delay in getting back to both of you on this.

First off, you have to create your index mapping for what to do when you
get a certain language in:

curl -XPUT 'http://localhost:9200/data/_settings' -d '
{
"settings": {
"analysis": {
"analyzer": {
"ar": {
"type":"arabic"
},
"hy": {
"type":"armenian"
},
"eu": {
"type":"basque"
}
.....

This is setting up the data index to have 3 analysers - when it gets an
input of "ar", it analyses using the "arabic" analyzer, when it gets "hy",
it uses the "armenian" analyzer and so on...

I then need to specify in my index type which field i was to use as
analyzer input:

I do this when specifying the type mapping:

curl -XPUT 'http://localhost:9200/data/data_language/_mapping' -d '{
"data_language":{
"_analyzer":{
"path":"language"
},
...,
"properties":{
...,
"language":{
"type":"string",
"index":"not_analyzed"
},
...
}
}
}'

So when a document is put into ES for the above type with a language like
"ar", the system automatically uses the "arabian" analyzer.

The main issue with this is that once the content is inserted and analyzed,
that's it; there is no changing the analyzer language later. The advantage
is that you can search over this data in either a language ambiguous way;
or else you can specify the locale/language you want to search in and only
get results in your language...

Like anything, it is all based on your workflow - test out the different
ways and figure out which works best :slight_smile:

D

On Thursday, 24 January 2013 11:06:36 UTC, Sapana Patel wrote:

Hi,

Hi,
I am also having same requirement in my project.
So I agree with your point 4.
Have you tried this ? Is it work for you?
Actually I tried with JAVA API but not able to do this..
So if you done this can you please guide me and provide sample code part
to do this with Java API

Thanks

--
Regards
Sapana Patel

On Wednesday, November 28, 2012 9:28:41 PM UTC+5:30, Derry O' Sullivan
wrote:

Hi all,

We have a document index which has a number of fields such as 'title',
'description', 'searches' etc. These can all be provided in the input from
any locale e.g. i could put in these 3 fields in english and some one else
add them in french etc.

I know there have been lots of posts on the group regarding:

  1. Setting an index analyzer in advance to analyze 'all content' (won't
    work with multiple languages on input)
  2. Using multiple indexes - 1 for each language (would prefer not to have
    to do this from an index/alias maintenance point of view)
  3. Using multiple fields (for each language) and analyzing per field
    (e.g. have a field for title-fr, title-de, title-en and separate the data
    with a specific field analyzer for each field). This would have the
    overhead of having to explicitly create the mapping for a number of fields
  • the number of languages you want to support
  1. Using single fields (all languages in one) and analyzing per
    language(e.g. have 1 field and then just set the analyzer on indexing that
    piece of content). This seems like the cleanest solution but i'm wondering
    if there is any search/indexing issue with having multi-lingual terms
    within 1 field.
  2. Multi-field values (similar to point 3) but explicitly using
    multi-field instead of multiple separated fields.

I'm leaning towards number 4 but would appreciate any feedback people
would have from experience,

Thanks,

Derry

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.