Highlight not working for multi-fields if each subfield is not stored, "main subfield" stored, and _source disabled

ofavre · July 5, 2011, 1:09pm

I guess the title resumes the problem:
Highlight not working for multi-fields if each subfield is not stored, "main subfield" stored, and _source disabled

Here is a curl recreation:

gist.github.com

https://gist.github.com/ofavre/1064746

recreation.sh

set -v
# Delete an eventual previous index
curl -XDELETE 'localhost:9200/bugindex'


# Create an index
curl -XPUT 'localhost:9200/bugindex' -d '{
	"settings" : {
		"index" : {
			"number_of_shards" : 1,

This file has been truncated. show original

A little explanation why we do not store _source:
We have many fields (around 200), our documents' source are around 5k.
We use SSD disks in prod, so : index size should remain reasonable, and disk seeks are not really problematic.
We often want to return a subset of the fields when getting results back.
We cannot afford to be too slow.
Therefore, we had to choose between either enabling _source, or storing each field.
To prevent source loading and parsing time waste, we chose the latter.
(Again, we know that this implies more IO seeks, but it's fine with SSD)
→ I hope we're not missing any reasoning step and our choice is/seems right.

So we do not have the _source.
With multi-fields, it seems completely useless to store each subfield's value as it is identical to the "main" subfield/property.
It also seems useless to index the "main" subfield as the principle of subfields is to provide different analysis and field index.
Therefore, we store "multi.multi" and neither "multi.exact" nor "multi.english", and we index the latter two and not the former.

The problem is that highlighting works against terms queries against specific fields, namely "multi.exact" or "multi.english", and wants to access their value, which can only be retrieved through stored "multi" field, which is "not the same field".
The problem lies in this last quoted expression.
Multi-fields' subfields should lookup the value of the main subfield/property.

However, we think that a more promising approach would be to multiplex the analyzed terms of each subfield into the main field/property field index. (This could be done manually by creating a special analyzer, but we will handle many languages and its not a beautiful solution, and it would be less beneficial.)

Implementing this solution may be beneficial also for:

Textual search where you may wish to index exact, normalize and stemmed terms in the same field index.
Group analyzed terms from multiple fields into a kind of "_all_terms" field which would leverage differenciated analyzers and rapidity of searching against a single field index.

By the way, the final goal is to perform a good search against documents using different languages (one per doc), be robust enough to language misdetection (as it controls the analyzer used for the query), and favor exact match over match through stemmed words.

Topic		Replies	Views
Highlighting fields of stored nested document with _source disabled Elasticsearch	3	764	July 6, 2017
Highlighting issues Elasticsearch	5	809	January 24, 2017
Highlight works on doc_text but not on doc.text for non-stored fields Elasticsearch	4	364	July 6, 2017
Just Puhsed: Allowing to highlight from source (no need for stored fields) Elasticsearch	3	480	July 6, 2017
Highlights not returned for multiple queries on the same JSON object field Elasticsearch	3	371	July 28, 2020

Highlight not working for multi-fields if each subfield is not stored, "main subfield" stored, and _source disabled

Related topics