MultiLingual Index

hash_include · December 30, 2015, 5:26am

Hi All

I have document corpus with few documents in chinese, few in German and others in english. I cannot create multiple indexes based on languages owing to current infrastructure.

I need to have one index with multiple analyzers on fields.
My current thoughts:
If there is a field "title" then we need to have title.german(german analyzer), title.chinese(cjk analyzer), title.english(english analyzer), title.general (standard). But this approach will have all documents analyzed in all possible analyzers bloating up the index size and index time. Is there a way to apply specific analyzers to specific documents based on language field?.

I am looking into ICUFolding and other aspects of multilingual search as well. Please guide me in this regards.

Thanks
Sri Harsha

loren · December 30, 2015, 6:55pm

I ran into the same problem. The way I went about it was based on the One Language per Field approach. I use a custom serializer to look at the document language at index time and then copy the title field over to a title_#{language} field. So a French document would end up with title and title_fr fields. I set up the index template to use a French analyzer for *_fr fields, and so on. For search, I use both fields to influence the score.

It's a similar approach to what you were suggesting, but here you only end up with one extra field per document.

If it's helpful, the code that handles all of this is part of this project. Relevant files are here, here, and here.

jprante · December 30, 2015, 8:23pm

Not any more.

In 1.x, you could select the analyzer from a path. So, you could index the language code based on your input, and the analyzer would be automatically set to german, english, 中文, whatever.

In 2.x this feature was removed.

Maybe I can find a trick to implement this again in my language detection plugin.

Topic		Replies	Views
Multilingual index options: _analyzer or multiple mappings or? Elasticsearch	2	625	July 6, 2017
Multiple Languages against single attribute Elasticsearch	5	1879	July 5, 2017
Multilingual field handling with multiple fields in ES Elasticsearch	4	1901	July 6, 2017
Indexing for multi-language support Elasticsearch	5	3004	July 5, 2017
One language per document and multiple languages per index Elasticsearch	1	650	January 13, 2017

MultiLingual Index

Related topics