Multilingual field handling with multiple fields in ES

Derry_O_Sullivan · November 28, 2012, 3:58pm

Hi all,

We have a document index which has a number of fields such as 'title',
'description', 'searches' etc. These can all be provided in the input from
any locale e.g. i could put in these 3 fields in english and some one else
add them in french etc.

I know there have been lots of posts on the group regarding:

Setting an index analyzer in advance to analyze 'all content' (won't
work with multiple languages on input)
Using multiple indexes - 1 for each language (would prefer not to have
to do this from an index/alias maintenance point of view)
Using multiple fields (for each language) and analyzing per field (e.g.
have a field for title-fr, title-de, title-en and separate the data with a
specific field analyzer for each field). This would have the overhead of
having to explicitly create the mapping for a number of fields * the number
of languages you want to support
Using single fields (all languages in one) and analyzing per
language(e.g. have 1 field and then just set the analyzer on indexing that
piece of content). This seems like the cleanest solution but i'm wondering
if there is any search/indexing issue with having multi-lingual terms
within 1 field.
Multi-field values (similar to point 3) but explicitly using multi-field
instead of multiple separated fields.

I'm leaning towards number 4 but would appreciate any feedback people would
have from experience,

Thanks,

Derry

--

D11 · January 3, 2013, 10:18pm

Hi Derry,

I'm facing the same problem but I'm leaning towards 5 simply because
there's only a handful of languages that we're using in practice. Can you
elaborate more on 4?

On Wednesday, November 28, 2012 5:58:41 PM UTC+2, Derry O' Sullivan wrote:

Hi all,

We have a document index which has a number of fields such as 'title',
'description', 'searches' etc. These can all be provided in the input from
any locale e.g. i could put in these 3 fields in english and some one else
add them in french etc.

I know there have been lots of posts on the group regarding:

Setting an index analyzer in advance to analyze 'all content' (won't
work with multiple languages on input)

Using multiple indexes - 1 for each language (would prefer not to have
to do this from an index/alias maintenance point of view)

Using multiple fields (for each language) and analyzing per field (e.g.
have a field for title-fr, title-de, title-en and separate the data with a
specific field analyzer for each field). This would have the overhead of
having to explicitly create the mapping for a number of fields * the number
of languages you want to support

Using single fields (all languages in one) and analyzing per
language(e.g. have 1 field and then just set the analyzer on indexing that
piece of content). This seems like the cleanest solution but i'm wondering
if there is any search/indexing issue with having multi-lingual terms
within 1 field.

Multi-field values (similar to point 3) but explicitly using
multi-field instead of multiple separated fields.

I'm leaning towards number 4 but would appreciate any feedback people
would have from experience,

Thanks,

Derry

--

Sapana_Patel · January 24, 2013, 11:06am

Hi,

Hi,
I am also having same requirement in my project.
So I agree with your point 4.
Have you tried this ? Is it work for you?
Actually I tried with JAVA API but not able to do this..
So if you done this can you please guide me and provide sample code part to
do this with Java API

Thanks

--
Regards
Sapana Patel

On Wednesday, November 28, 2012 9:28:41 PM UTC+5:30, Derry O' Sullivan
wrote:

Hi all,

We have a document index which has a number of fields such as 'title',
'description', 'searches' etc. These can all be provided in the input from
any locale e.g. i could put in these 3 fields in english and some one else
add them in french etc.

I know there have been lots of posts on the group regarding:

Setting an index analyzer in advance to analyze 'all content' (won't
work with multiple languages on input)

Using multiple indexes - 1 for each language (would prefer not to have
to do this from an index/alias maintenance point of view)

Using multiple fields (for each language) and analyzing per field (e.g.
have a field for title-fr, title-de, title-en and separate the data with a
specific field analyzer for each field). This would have the overhead of
having to explicitly create the mapping for a number of fields * the number
of languages you want to support

Using single fields (all languages in one) and analyzing per
language(e.g. have 1 field and then just set the analyzer on indexing that
piece of content). This seems like the cleanest solution but i'm wondering
if there is any search/indexing issue with having multi-lingual terms
within 1 field.

Multi-field values (similar to point 3) but explicitly using
multi-field instead of multiple separated fields.

I'm leaning towards number 4 but would appreciate any feedback people
would have from experience,

Thanks,

Derry

--

Derry_O_Sullivan · February 4, 2013, 10:56am

Sorry for the delay in getting back to both of you on this.

First off, you have to create your index mapping for what to do when you
get a certain language in:

curl -XPUT 'http://localhost:9200/data/_settings' -d '
{
"settings": {
"analysis": {
"analyzer": {
"ar": {
"type":"arabic"
},
"hy": {
"type":"armenian"
},
"eu": {
"type":"basque"
}
.....

This is setting up the data index to have 3 analysers - when it gets an
input of "ar", it analyses using the "arabic" analyzer, when it gets "hy",
it uses the "armenian" analyzer and so on...

I then need to specify in my index type which field i was to use as
analyzer input:

I do this when specifying the type mapping:

curl -XPUT 'http://localhost:9200/data/data_language/_mapping' -d '{
"data_language":{
"_analyzer":{
"path":"language"
},
...,
"properties":{
...,
"language":{
"type":"string",
"index":"not_analyzed"
},
...
}
}
}'

So when a document is put into ES for the above type with a language like
"ar", the system automatically uses the "arabian" analyzer.

The main issue with this is that once the content is inserted and analyzed,
that's it; there is no changing the analyzer language later. The advantage
is that you can search over this data in either a language ambiguous way;
or else you can specify the locale/language you want to search in and only
get results in your language...

Like anything, it is all based on your workflow - test out the different
ways and figure out which works best

D

On Thursday, 24 January 2013 11:06:36 UTC, Sapana Patel wrote:

Hi,

Hi,
I am also having same requirement in my project.
So I agree with your point 4.
Have you tried this ? Is it work for you?
Actually I tried with JAVA API but not able to do this..
So if you done this can you please guide me and provide sample code part
to do this with Java API

Thanks

--
Regards
Sapana Patel

On Wednesday, November 28, 2012 9:28:41 PM UTC+5:30, Derry O' Sullivan
wrote:

Hi all,

We have a document index which has a number of fields such as 'title',
'description', 'searches' etc. These can all be provided in the input from
any locale e.g. i could put in these 3 fields in english and some one else
add them in french etc.

I know there have been lots of posts on the group regarding:

Setting an index analyzer in advance to analyze 'all content' (won't
work with multiple languages on input)

Using multiple indexes - 1 for each language (would prefer not to have
to do this from an index/alias maintenance point of view)

Using multiple fields (for each language) and analyzing per field
(e.g. have a field for title-fr, title-de, title-en and separate the data
with a specific field analyzer for each field). This would have the
overhead of having to explicitly create the mapping for a number of fields

the number of languages you want to support

Using single fields (all languages in one) and analyzing per
language(e.g. have 1 field and then just set the analyzer on indexing that
piece of content). This seems like the cleanest solution but i'm wondering
if there is any search/indexing issue with having multi-lingual terms
within 1 field.

Multi-field values (similar to point 3) but explicitly using
multi-field instead of multiple separated fields.

I'm leaning towards number 4 but would appreciate any feedback people
would have from experience,

Thanks,

Derry

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Handling multiple languages Elasticsearch	1	300	July 6, 2017
Bets practice for indexing documents of various languages Elasticsearch	3	537	July 19, 2017
Multilingual index options: _analyzer or multiple mappings or? Elasticsearch	2	623	July 6, 2017
Will this document structure work for multiple language indexing? Elasticsearch	2	883	July 5, 2017
Multiple Languages against single attribute Elasticsearch	5	1873	July 5, 2017

Multilingual field handling with multiple fields in ES

Related topics