Annotated text plugin and index_options/term_vectors

robmartin11 · February 11, 2021, 10:33am

Im using the annotated_text field to index documents but for larger documents Im getting errors:
index has exceeded [1000000] - maximum allowed to be analyzed for highlighting. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!.

Ive tried adding "index_options": "offsets" and "term_vector": "with_positions_offsets" but neither seem to have any effect:

PUT test_index
{
    "mappings": {
        "properties": {
            "annotated_with_offsets": {
                "type": "annotated_text",
                "index_options": "offsets"
            },
            "annotated_with_term_vector": {
                "type": "annotated_text",
                "term_vector": "with_positions_offsets"
            },
            "text_with_offsets": {
                "type": "text",
                "index_options": "offsets"
            },
            "text_with_term_vector": {
                "type": "text",
                "term_vector": "with_positions_offsets"
            }
        }
    }
}

GET /test_index/_mapping

{
    "test_index": {
        "mappings": {
            "properties": {
                "annotated_with_offsets": {
                    "type": "annotated_text"
                },
                "annotated_with_term_vector": {
                    "type": "annotated_text"
                },
                "text_with_offsets": {
                    "type": "text",
                    "index_options": "offsets"
                },
                "text_with_term_vector": {
                    "type": "text",
                    "term_vector": "with_positions_offsets"
                }
            }
        }
    }
}

Mark_Harwood · February 15, 2021, 9:39am

This error is a safeguard kicking in to avoid servers being overloaded by expensive highlighting tasks.
There is some work underway to avoid these errors failing searches and just curtail any highlighting work on individual long docs.

The annotated text field has a custom highlighter that does some special tricks to deal with annotations which sadly means it doesn't use term_vectors so that won't help here. Some potential workarounds:

You could try increase the index.highlight.max_analyzed_offset setting (I ended up doing that to look at Wikipedia content).
You could run a query with a must_not on the size of the doc to avoid trying to highlight super-large documents and hitting this error.
You could break up large documents into smaller ones e.g. instead of indexing whole books, index chapters.

system · March 15, 2021, 9:39am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
For large texts, indexing with offsets or term vectors is recommended Elasticsearch	3	5047	March 31, 2021
Does increasing index.highlight.max_analyzed_offset to a larger number hurt query performance? If so, how much, and why? Elasticsearch	1	1110	September 25, 2019
Way to avoid hitting max field length. term_vectors vs offsets Elasticsearch	1	601	August 23, 2019
The length of text to be analyzed for highlighting [18031] exceeded the allowed maximum of [10000] set for the next major Elastic version. For large texts, indexing with offsets or term vectors is recommended! Elasticsearch	7	13001	May 29, 2018
Combining index_prefixes and term_vector Elasticsearch	1	519	June 5, 2019

Annotated text plugin and index_options/term_vectors

Related topics