MultiLingual Index


(Hash Include) #1

Hi All

I have document corpus with few documents in chinese, few in German and others in english. I cannot create multiple indexes based on languages owing to current infrastructure.

I need to have one index with multiple analyzers on fields.
My current thoughts:
If there is a field "title" then we need to have title.german(german analyzer), title.chinese(cjk analyzer), title.english(english analyzer), title.general (standard). But this approach will have all documents analyzed in all possible analyzers bloating up the index size and index time. Is there a way to apply specific analyzers to specific documents based on language field?.

I am looking into ICUFolding and other aspects of multilingual search as well. Please guide me in this regards.

Thanks
Sri Harsha


(Loren Siebert) #2

I ran into the same problem. The way I went about it was based on the One Language per Field approach. I use a custom serializer to look at the document language at index time and then copy the title field over to a title_#{language} field. So a French document would end up with title and title_fr fields. I set up the index template to use a French analyzer for *_fr fields, and so on. For search, I use both fields to influence the score.

It's a similar approach to what you were suggesting, but here you only end up with one extra field per document.

If it's helpful, the code that handles all of this is part of this project. Relevant files are here, here, and here.


(Jörg Prante) #3

Not any more.

In 1.x, you could select the analyzer from a path. So, you could index the language code based on your input, and the analyzer would be automatically set to german, english, 中文, whatever.

In 2.x this feature was removed.

https://www.elastic.co/guide/en/elasticsearch/reference/2.1/mapping-analyzer-field.html

Maybe I can find a trick to implement this again in my language detection plugin.


(system) #4