Hello, I'm trying to figure out the best way to index multi-language content without a definitive list of supported languages, meaning the list is dynamic and will grow quickly depending on user input (we work with historical data, meaning potentially hundreds of different languages).
By multi-language I mean the text can contain several languages at once, up to 5 in current use cases; they are also sometimes translated but that's a non-issue.
Considering the quality of search is critical to the project, we don't want to only use a generic trigram analyzer. The only good news is that we can ask the user the content languages so at least they are known before indexation.
If I understand https://www.elastic.co/guide/en/elasticsearch/guide/current/language-pitfalls.html and https://www.elastic.co/guide/en/elasticsearch/guide/current/mixed-lang-fields.html correctly we have 2 options:
- Add a subfield for each unique analyzer (so around 25 for now), including the fallback analyzer (trigram).
cons: The content will always be analyzed 25+ times! this seems extremely inefficient. Not sure how it deals with scoring in that case.
pros: We can query one field and not care about the language of the query (?)
- A different field for each unique analyzer
cons: the app code is going to be really ugly, both models and search, may force us to use _all (?)
pros: we can only fill the relevant languages fields, which should make the indexing much much faster
Note: didn't list a separate indices per language as an option as it doesn't go well with multiple languages in a single text.