Indexing multi language documents with langdetect

Suppose I want to index my email box. My emails all have "Body" section. The problem is that not all my emails are in the same language. E.g. first email might be in English (so it's "Body" field is in English), the second one is in Polish and the third one is in Japanese (just an example, I don't know that many languages).

What I want is to be able to index them all in a single ES index and have a good search for them. Thus I need to apply different analyzers for different languages. Elasticsearch documentation suggest using multi field (field per language) approach. The problem is that I don't know at the time of indexing which language "Body" field is written in.

I found this library for langdetect support: It seems like it does just what I want -- at the moment of indexing a document it can detect the right language and store it in an appropriate field (e.g. email with "Body" in English will be stored in "Body.en" field, and email with "Body" in Japanese will be stored in "Body.ja" field). With this approach I can use all the advantages of multi field approach without explicitly trying to detect "Body" language.

However, the main problem with the library above is that (according to its README, issues, and commits history) it is outdated and is not maintained anymore. The last supported ES version is 5.4

I know that there is also this library out there: But it seems like it does not support functionality like "I want you to detect the language of "SomeField" field and if it's "English", then store it's content in "SomeField.en" field". I could use this library to explicitly detect the language of a message before indexing it by making a request to ES and retrieving "language" field from the result, but it seems to me that it will completely eliminate all the advantages of bulk API (currently I index my emails in batches) which can give a significant performance boost.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.