I guess the title resumes the problem:
Highlight not working for multi-fields if each subfield is not stored, "main subfield" stored, and _source disabled
Here is a curl recreation:
A little explanation why we do not store _source:
We have many fields (around 200), our documents' source are around 5k.
We use SSD disks in prod, so : index size should remain reasonable, and disk seeks are not really problematic.
We often want to return a subset of the fields when getting results back.
We cannot afford to be too slow.
Therefore, we had to choose between either enabling _source, or storing each field.
To prevent source loading and parsing time waste, we chose the latter.
(Again, we know that this implies more IO seeks, but it's fine with SSD)
→ I hope we're not missing any reasoning step and our choice is/seems right.
So we do not have the _source.
With multi-fields, it seems completely useless to store each subfield's value as it is identical to the "main" subfield/property.
It also seems useless to index the "main" subfield as the principle of subfields is to provide different analysis and field index.
Therefore, we store "multi.multi" and neither "multi.exact" nor "multi.english", and we index the latter two and not the former.
The problem is that highlighting works against terms queries against specific fields, namely "multi.exact" or "multi.english", and wants to access their value, which can only be retrieved through stored "multi" field, which is "not the same field".
The problem lies in this last quoted expression.
Multi-fields' subfields should lookup the value of the main subfield/property.
However, we think that a more promising approach would be to multiplex the analyzed terms of each subfield into the main field/property field index. (This could be done manually by creating a special analyzer, but we will handle many languages and its not a beautiful solution, and it would be less beneficial.)
Implementing this solution may be beneficial also for:
- Textual search where you may wish to index exact, normalize and stemmed terms in the same field index.
- Group analyzed terms from multiple fields into a kind of "_all_terms" field which would leverage differenciated analyzers and rapidity of searching against a single field index.
By the way, the final goal is to perform a good search against documents using different languages (one per doc), be robust enough to language misdetection (as it controls the analyzer used for the query), and favor exact match over match through stemmed words.