I have a corpus of content (documents) that are multilingual ( around 25 languages) with around 3-4 multi-lingual fields (title, text, description, keywords). We are considering both
- a language per field approach and
- a language per index approach
The querying scenario also does not always involve a single language as the querying person is interested in documents in a few languages.
I would like to know the effect of having around 100 fields getting added to a document.
- Does it cause any memory overhead (field caches etc.?)
- Is there an impact on the mere presence of a large number of null fields and the impact on the index itself.
- Do the analyzed, tokenized terms get stored on a per field level?
If I were to go with the multi-index model (language per index) what would be the impact of having 200 million+ docs distributed over 25 indexes, of which a few indexes would have the bulk of the documents (70%) .
Which would be a preferred model. The index/language model avoids having to craft queries across different fields across languages, and seems more easily manageable.
Can anyone throw some light on this?