Highlighting takes long time for large documents

phill · July 19, 2012, 12:15am

On 7/18/2012 1:53 PM, Martijn van Groningen wrote:

Depends, but fast vector highlighting is around 2.5 times faster than
normal highlighting.

I just finished a small experiment related to highlighting and use of
FastVectorHighlighter.

1 Speed

I found that when human generated documents (as opposed to product
catalog entries or log entries etc.)
where a lot faster then the quoted "2.5 times" (from the original
release notes) particularly for large files.
Like Jan, I had the occasional very large file. Any query that returned
two such ridiculous files and a few very large files would time out
before returning and might run for
take a total of 2 seconds! Queries are for a page of 10 results each
result with one (best) hit highlight fragment.
The things that highlighted fast before were many times faster (like
5-10x) while the slow ones were still 2+ times faster!
I can't find any queries (1 page of 10 results) that take more than 400
ms to do everything (including a bunch of overhead in the UI).

2 Index size

Our indexes include the Tika parsed text. After adding offsets, two
examples test indexes change size by the following.
Before 27.4 MB -> after 34.1 MB (766 documents) = 24% increase
Before 941 MB -> after 1412 MB (9705 documents) = 50% increase (many
files occur twice in this corpus, so I believe the index overhead was
inherently less thus adding offsets were a large percentage.

Another index of 10,742 files (with less repeats than the 9705 file
index above) including offsets and the parsed text resulted in an
on-disk size of 537 MB which was about 40% of the size of all files (1.3
GB) and 198% of the total of all Tika parsed text (271 MB).

Note, I don't consider these large examples.

Conclusions

An index with parsed text including term vectors with positions and
offsets is roughly half text and half indexing information.
Since half our index is the parsed text, the increase in size when
adding offsets would be larger (x2) for indexes without the parsed text.

Adding offsets to an index with parsed text can add 25-50% to the size
of an index and can REALLY HELP highlight large files often with an
approximately a_MINIMUM_ of 2-2.5x improvement.
I think the folks at Lucene were being very conservative in their
estimates of speed increases.

YMMV,

-Paul

Topic		Replies	Views
For large texts, indexing with offsets or term vectors is recommended Elasticsearch	3	5047	March 31, 2021
Highlighting just fields? Elasticsearch	3	263	July 6, 2017
Elasticsearch Highlighting is very slow Elasticsearch	1	936	January 10, 2019
"highlight" in query cause long delay Elasticsearch	2	880	September 5, 2017
Does increasing index.highlight.max_analyzed_offset to a larger number hurt query performance? If so, how much, and why? Elasticsearch	1	1109	September 25, 2019

Highlighting takes long time for large documents

Related topics