Effect of cardinality of the fields (stored=false) for purposes of changing text analysis in a linguistic context

swamirajamohan · July 14, 2016, 5:56pm

I have a corpus of content (documents) that are multilingual ( around 25 languages) with around 3-4 multi-lingual fields (title, text, description, keywords). We are considering both

a language per field approach and
a language per index approach

The querying scenario also does not always involve a single language as the querying person is interested in documents in a few languages.

I would like to know the effect of having around 100 fields getting added to a document.

Does it cause any memory overhead (field caches etc.?)
Is there an impact on the mere presence of a large number of null fields and the impact on the index itself.
Do the analyzed, tokenized terms get stored on a per field level?

If I were to go with the multi-index model (language per index) what would be the impact of having 200 million+ docs distributed over 25 indexes, of which a few indexes would have the bulk of the documents (70%) .

Which would be a preferred model. The index/language model avoids having to craft queries across different fields across languages, and seems more easily manageable.

Can anyone throw some light on this?

Thanks
Swami

Topic		Replies	Views
One Language per field vs. multi-fields for large number of supported languages Elasticsearch	1	780	July 5, 2017
Multi language index, documents performance? Elasticsearch	2	593	July 5, 2017
Field per language and total number of fields performance concern Elasticsearch	1	604	July 5, 2017
Bets practice for indexing documents of various languages Elasticsearch	3	576	July 19, 2017
When to multi_field Elasticsearch	2	310	July 6, 2017

Effect of cardinality of the fields (stored=false) for purposes of changing text analysis in a linguistic context

Related topics