Highlighting takes long time for large documents

On 7/18/2012 1:53 PM, Martijn van Groningen wrote:

Depends, but fast vector highlighting is around 2.5 times faster than
normal highlighting.

I just finished a small experiment related to highlighting and use of
FastVectorHighlighter.

1 Speed

I found that when human generated documents (as opposed to product
catalog entries or log entries etc.)
where a lot faster then the quoted "2.5 times" (from the original
release notes) particularly for large files.
Like Jan, I had the occasional very large file. Any query that returned
two such ridiculous files and a few very large files would time out
before returning and might run for
take a total of 2 seconds! Queries are for a page of 10 results each
result with one (best) hit highlight fragment.
The things that highlighted fast before were many times faster (like
5-10x) while the slow ones were still 2+ times faster!
I can't find any queries (1 page of 10 results) that take more than 400
ms to do everything (including a bunch of overhead in the UI).

2 Index size

Our indexes include the Tika parsed text. After adding offsets, two
examples test indexes change size by the following.
Before 27.4 MB -> after 34.1 MB (766 documents) = 24% increase
Before 941 MB -> after 1412 MB (9705 documents) = 50% increase (many
files occur twice in this corpus, so I believe the index overhead was
inherently less thus adding offsets were a large percentage.

Another index of 10,742 files (with less repeats than the 9705 file
index above) including offsets and the parsed text resulted in an
on-disk size of 537 MB which was about 40% of the size of all files (1.3
GB) and 198% of the total of all Tika parsed text (271 MB).

Note, I don't consider these large examples.

Conclusions

An index with parsed text including term vectors with positions and
offsets is roughly half text and half indexing information.
Since half our index is the parsed text, the increase in size when
adding offsets would be larger (x2) for indexes without the parsed text.

Adding offsets to an index with parsed text can add 25-50% to the size
of an index and can REALLY HELP highlight large files often with an
approximately a_MINIMUM_ of 2-2.5x improvement.
I think the folks at Lucene were being very conservative in their
estimates of speed increases.

YMMV,

-Paul