Best practices for multi language search and index

Hi, i´m new :slight_smile:

I have the next scenario: a document with a nested field that includes one or more "blocks" with a text field each. That text field could be in one of four different languages. (I know the language of each block). Different "blocks" of same document could be in different languages.

What is a better approach for my index?

A) Always fill the field 'content' and all of its sub-fields

 	"content":{
		"type": "text",
		"fields": {
			"es": {
				"type": "text",
				"analyzer": "rebuilt_spanish"
			},
			"en": {
				"type": "text",
				"analyzer": "rebuilt_english"
			},
			(...)
		}
	}

B) Pre-process the data to index and fill the proper field of each block leaving the others empty:

	"contentES":{
		"type": "text",
		"fields": {
			"es": { 
				"type": "text",
				"analyzer": "rebuilt_spanish"
			}
		}
	},
	"contentEN":{
		"type": "text",
		"fields": {
			"en": { 
				"type": "text",
				"analyzer": "rebuilt_english"
			}
		}
	}
	(...)

Is it too many memory and space consumption in case A) ?
It is easier for me to produce the data to index in first case, ignoring the language.
In both cases, i will do a multi_field search.

What do you think?
Thank you!!

The Definitive Guide seems to suggest option A for you, but it's worth reading that whole chapter to evaluate the trade-offs of the other approaches to dealing with human language. As the doc says, mixed language fields "are the most difficult type of multilingual document to handle correctly".

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.