For large texts, indexing with offsets or term vectors is recommended

I have an index that contains the text of files. Most files aren't that large, but some do have more than 1,000,000 characters. I'm using the default unified highlighter when displaying results to the end users. When a search term that returns one of the large files is executed, I'm getting the following error:

The length of [fileText.stemmed] field of [6] doc of [attachments] index has exceeded [1000000] - maximum allowed to be analyzed for highlighting. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended! (illegal_argument_exception)

I would prefer to not increase the highlight.max_analyzed_offset setting because it seems like it would hurt performance and I would have to know the max size of the documents in my index which is constantly changing. The error message indicates that "indexing with offsets or term vectors is recommended", but I can not find any documentation on how to implement this. Any pointers on how to handle this would be greatly appreciated.

Thanks.

This situation wasn't handled particularly well in elasticsearch - throwing out a whole search request because one doc was large.
Coming in 7.12 is a new query flag which simply truncates the size of doc text we try highlighting on rather than requiring you to reindex or up the limit set on the index. It's a better overall solution.

As a workaround one approach I often advocate is to take large documents e.g. books and index as multiple smaller docs e.g. one per chapter

Thanks for the suggestions. We're on 7.11.1, so we will migrate up to 7.12. In addition, I was able to update the problematic field's definition with the term_vector: with_positions_offsets and that seems to have done the trick. If performance becomes an issue, I'll probably break up the documents into smaller string values as you suggested.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.