For large texts, indexing with offsets or term vectors is recommended

cas4 · March 3, 2021, 1:42pm

I have an index that contains the text of files. Most files aren't that large, but some do have more than 1,000,000 characters. I'm using the default unified highlighter when displaying results to the end users. When a search term that returns one of the large files is executed, I'm getting the following error:

The length of [fileText.stemmed] field of [6] doc of [attachments] index has exceeded [1000000] - maximum allowed to be analyzed for highlighting. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended! (illegal_argument_exception)

I would prefer to not increase the highlight.max_analyzed_offset setting because it seems like it would hurt performance and I would have to know the max size of the documents in my index which is constantly changing. The error message indicates that "indexing with offsets or term vectors is recommended", but I can not find any documentation on how to implement this. Any pointers on how to handle this would be greatly appreciated.

Thanks.

Mark_Harwood · March 3, 2021, 2:44pm

This situation wasn't handled particularly well in elasticsearch - throwing out a whole search request because one doc was large.
Coming in 7.12 is a new query flag which simply truncates the size of doc text we try highlighting on rather than requiring you to reindex or up the limit set on the index. It's a better overall solution.

As a workaround one approach I often advocate is to take large documents e.g. books and index as multiple smaller docs e.g. one per chapter

cas4 · March 3, 2021, 4:47pm

Thanks for the suggestions. We're on 7.11.1, so we will migrate up to 7.12. In addition, I was able to update the problematic field's definition with the term_vector: with_positions_offsets and that seems to have done the trick. If performance becomes an issue, I'll probably break up the documents into smaller string values as you suggested.

system · March 31, 2021, 4:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Does increasing index.highlight.max_analyzed_offset to a larger number hurt query performance? If so, how much, and why? Elasticsearch	1	1116	September 25, 2019
Way to avoid hitting max field length. term_vectors vs offsets Elasticsearch	1	601	August 23, 2019
Annotated text plugin and index_options/term_vectors Elasticsearch	2	438	March 15, 2021
The length of text to be analyzed for highlighting [18031] exceeded the allowed maximum of [10000] set for the next major Elastic version. For large texts, indexing with offsets or term vectors is recommended! Elasticsearch	7	13002	May 29, 2018
Highlighting takes long time for large documents Elasticsearch	7	4920	July 6, 2017

For large texts, indexing with offsets or term vectors is recommended

Related topics