One Language per field vs. multi-fields for large number of supported languages


(George P. Stathis) #1

We are currently using the one language per field approach (https://www.elastic.co/guide/en/elasticsearch/guide/current/one-lang-fields.html) to support about ten different languages. We don't send the same content to all the fields. We instead detect the language before indexing and then select which field to send the content to.

It's been suggested that we consider using multi-fields to do this (https://www.elastic.co/guide/en/elasticsearch/guide/current/mixed-lang-fields.html#_analyze_multiple_times) to reduce the amount of data we send to the index.

It seems to me that with the number of languages we have now (soon to be doubled to 20), the multi-field approach might be more wasteful that the one-field-per-language one. We might be sending the content once but we would be needlessly putting it though a lot more analyzers that we do now. E.g. why would I analyze French content with a Dutch analyzer if I already know the content is in French? Wouldn't we be creating a lot more tokens than necessary?

I'm thinking multi-fields might not be the right call here but looking for a sanity check.


(system) #2