Multi-language content

Lauxley · November 18, 2019, 11:06am

Hello, I'm trying to figure out the best way to index multi-language content without a definitive list of supported languages, meaning the list is dynamic and will grow quickly depending on user input (we work with historical data, meaning potentially hundreds of different languages).
By multi-language I mean the text can contain several languages at once, up to 5 in current use cases; they are also sometimes translated but that's a non-issue.
Considering the quality of search is critical to the project, we don't want to only use a generic trigram analyzer. The only good news is that we can ask the user the content languages so at least they are known before indexation.

If I understand https://www.elastic.co/guide/en/elasticsearch/guide/current/language-pitfalls.html and https://www.elastic.co/guide/en/elasticsearch/guide/current/mixed-lang-fields.html correctly we have 2 options:

Add a subfield for each unique analyzer (so around 25 for now), including the fallback analyzer (trigram).
cons: The content will always be analyzed 25+ times! this seems extremely inefficient. Not sure how it deals with scoring in that case.
pros: We can query one field and not care about the language of the query (?)
A different field for each unique analyzer
cons: the app code is going to be really ugly, both models and search, may force us to use _all (?)
pros: we can only fill the relevant languages fields, which should make the indexing much much faster

Note: didn't list a separate indices per language as an option as it doesn't go well with multiple languages in a single text.

Thank you.

system · December 16, 2019, 11:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best way to index multiple languages Elasticsearch	9	10274	July 6, 2017
Multilingual index options: _analyzer or multiple mappings or? Elasticsearch	2	636	July 6, 2017
Multi-language analyzers in Elastic Search Elasticsearch	3	1215	August 17, 2017
Multiple Languages against single attribute Elasticsearch	5	1896	July 5, 2017
Multilingual field handling with multiple fields in ES Elasticsearch	4	1929	July 6, 2017

Multi-language content

Related topics