Hi everyone - We have been exploring what it would take to implement multi
lingual search solution using ElasticSearch. In case anyone has already
done it, it would be great to hear their experience on the same. A few
questions I have is:
Would one have to create separate index for each language?
If the index is separate, would it be a very different mapping?
Would the queries be constructed in different ways for different
languages?
I am just keen to listen to the key considerations to be taken into account
when thinking of implementing it.
You do not need separate indexes. Language can be per field (or even
mixed into a single field).
You can assign each field different analyzers. If you use index types,
there is nothing to prevent you from setting up a field "content", and
assign english analyzer to it in index type "english", german analyzer in
index type "german", french analyzer in index type "french" and so on. You
can also use minimal or no stemming at all and use a single field for all
languages. It can be useful if you do not know what languages you have to
index. You can also use the langdetect plugin in that case and attach the
language code in the doc for search filter. This totally depends on your
requirements.
No.
You do not mention the biggest challenge for multilingual search. It is
language independent normalization and case folding for robust search. The
ICU analysis plugin is very valuable for this
It all really, really depends on your content and business requirements,
and the amounts of data you have.
For us it makes sense to have everything in one index but use different
analyzer for each document based on the main language detected for the
text. But it's just our way of doing that.
Hi everyone - We have been exploring what it would take to implement multi
lingual search solution using Elasticsearch. In case anyone has already
done it, it would be great to hear their experience on the same. A few
questions I have is:
Would one have to create separate index for each language?
If the index is separate, would it be a very different mapping?
Would the queries be constructed in different ways for different
languages?
I am just keen to listen to the key considerations to be taken into
account when thinking of implementing it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.