Multi-language analyzers in Elastic Search


(SR) #1

Hi ES Team,

We are building a real-time search system for one of our applications. Our customer base is huge and so does the data, we have data for around 13TB current day and it is intended to grow. The customer can search in any language. As of now, we are providing the option to search on 8 languages and the number of languages would not grow frequently.
Given this information, we are having an index with many fields and two of the fields are language dependent and those fields have to be searchable on all languages. Currently all documents are already indexed in english. To enable the multiple languages on the two fields, we have two options:

  1. Use one language per field. Problem here, re-indexing the existing data(in the huge index) could be cumbersome if the existing index size grows in future.
  2. Have two indexes. One of the index would only be used for mere translation purpose to get the translated word in english (by searching with a foreign word). With this english word, we again search in the existing index to get the customer data in english.

Which approach could be better and simpler given the fact that index size is intended to grow in future. Any help would be appreciated!


(Ryan Ernst) #2

The first approach (one language per field) is going to yield better results, since query analysis will happen within that native language. Translating will probably have issues (since translation usually will not yield the exact same meaning as the original phrasing intended).


(SR) #3

Explanation for second approach -
I meant, I will have this index called "translateIndex" which has data in this format. Just an example(it is incomplete)
{
gato_es : cat,
cat_fr : cat,
gata_de: cat
}
Basically an index which has fields for each foreign word to english word with all language analyzers applied on each field.
Now when a customer searches a word say "gatos" in spanish(a plural word), we first lookup for that word in ES in this index and the results would be {gato_es: cat} after applying analyzers.
With this english word "cat", I will now search on my "mainIndex"(this index has all data in english).
So basically, the "translateIndex" is used for applying language analyzers on the fields and giving me a english word.
To summarize: I would do search on ES twice everytime use searches for something, one on the "translateIndex" and other on the "mainIndex".
Any problem here you think?


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.