Highlighting takes long time for large documents

Hi

My index contains objects of various sizes, usually around 1-100 kB.
However, sometimes there are documents 10-30 MB. When I get search hits for
these objects highlighting takes very long time.

Highlighting (large object included): 263257 ms
Highlighting (large object excluded): 32 ms
Without highlighting (large object included): 328 ms

The large object is around 25 MB.

Is this the expected behavior?
Will providing "term_vector information" by setting
"with_positions_offsets" help with this problem?
Is there any other tuning I can perform?

I'm running a single node using 0.19.4 on Ubuntu 12.04 using an EC2 XL
instance. The index is currently stored on EBS. I have increased the memory
for Elastic Search to 8 GB, however usually the allocated heap is around
1.5 GB.

I have several indices at a total of 40 GB. The particular index where this
happens is 5 GB with 315000 documents, although I have similar problems
with the other indices.

java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.3) (6b24-1.11.3-1ubuntu0.12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

/Jan

Hi Jan,

Highlighting large pieces of text is slow. Setting 'term_vector' to
'with_positions_offset' will increase
the highlighting speed (under the hood this will trigger Lucene's fast
vector highlighter). Setting
'term_vector' to 'with_positions_offset' will increase the size of
your index dramatically. How much exactly is hard to say, but I
wouldn't be
surprised if your index doubled in size. How much exactly depends on
how many fields you set term_vectors to with_positions_offsets
and your input data.

Martijn

How much faster will the results be with_positions_offset? From 260
seconds to .... a few seconds? Or even faster?

Any other design suggestions? For example since I'm only doing
highlighting on a single field I'm considering saving a truncated
version of that field that only contains the first 100 kB (or
something) which will work in most cases and I'll have to live with
perhaps missing highlighting for the really large documents.

/Jan

On Wed, Jul 18, 2012 at 6:00 PM, Martijn van Groningen
martijn.is.hier@gmail.com wrote:

Hi Jan,

Highlighting large pieces of text is slow. Setting 'term_vector' to
'with_positions_offset' will increase
the highlighting speed (under the hood this will trigger Lucene's fast
vector highlighter). Setting
'term_vector' to 'with_positions_offset' will increase the size of
your index dramatically. How much exactly is hard to say, but I
wouldn't be
surprised if your index doubled in size. How much exactly depends on
how many fields you set term_vectors to with_positions_offsets
and your input data.

Martijn

On 18 July 2012 19:51, Jan Kronquist jan.kronquist@gmail.com wrote:

How much faster will the results be with_positions_offset? From 260
seconds to .... a few seconds? Or even faster?
Depends, but fast vector highlighting is around 2.5 times faster than
normal highlighting.

Any other design suggestions? For example since I'm only doing
highlighting on a single field I'm considering saving a truncated
version of that field that only contains the first 100 kB (or
something) which will work in most cases and I'll have to live with
perhaps missing highlighting for the really large documents.
Not really unless less the text to highlight is reduced, which would
obviously lower the time spend for highlighting by a lot, but this might not
be acceptable for all use cases. In general

--
Met vriendelijke groet,

Martijn van Groningen

On 7/18/2012 1:53 PM, Martijn van Groningen wrote:

Depends, but fast vector highlighting is around 2.5 times faster than
normal highlighting.

I just finished a small experiment related to highlighting and use of
FastVectorHighlighter.

1 Speed

I found that when human generated documents (as opposed to product
catalog entries or log entries etc.)
where a lot faster then the quoted "2.5 times" (from the original
release notes) particularly for large files.
Like Jan, I had the occasional very large file. Any query that returned
two such ridiculous files and a few very large files would time out
before returning and might run for
take a total of 2 seconds! Queries are for a page of 10 results each
result with one (best) hit highlight fragment.
The things that highlighted fast before were many times faster (like
5-10x) while the slow ones were still 2+ times faster!
I can't find any queries (1 page of 10 results) that take more than 400
ms to do everything (including a bunch of overhead in the UI).

2 Index size

Our indexes include the Tika parsed text. After adding offsets, two
examples test indexes change size by the following.
Before 27.4 MB -> after 34.1 MB (766 documents) = 24% increase
Before 941 MB -> after 1412 MB (9705 documents) = 50% increase (many
files occur twice in this corpus, so I believe the index overhead was
inherently less thus adding offsets were a large percentage.

Another index of 10,742 files (with less repeats than the 9705 file
index above) including offsets and the parsed text resulted in an
on-disk size of 537 MB which was about 40% of the size of all files (1.3
GB) and 198% of the total of all Tika parsed text (271 MB).

Note, I don't consider these large examples.

Conclusions

An index with parsed text including term vectors with positions and
offsets is roughly half text and half indexing information.
Since half our index is the parsed text, the increase in size when
adding offsets would be larger (x2) for indexes without the parsed text.

Adding offsets to an index with parsed text can add 25-50% to the size
of an index and can REALLY HELP highlight large files often with an
approximately a_MINIMUM_ of 2-2.5x improvement.
I think the folks at Lucene were being very conservative in their
estimates of speed increases.

YMMV,

-Paul

Thanks for this feedback.
Very interesting thread.

David

Le 19 juil. 2012 à 02:15, "P. Hill" parehill1@gmail.com a écrit :

On 7/18/2012 1:53 PM, Martijn van Groningen wrote:

Depends, but fast vector highlighting is around 2.5 times faster than
normal highlighting.

I just finished a small experiment related to highlighting and use of FastVectorHighlighter.

1 Speed

I found that when human generated documents (as opposed to product catalog entries or log entries etc.)
where a lot faster then the quoted "2.5 times" (from the original release notes) particularly for large files.
Like Jan, I had the occasional very large file. Any query that returned two such ridiculous files and a few very large files would time out before returning and might run for
take a total of 2 seconds! Queries are for a page of 10 results each result with one (best) hit highlight fragment.
The things that highlighted fast before were many times faster (like 5-10x) while the slow ones were still 2+ times faster!
I can't find any queries (1 page of 10 results) that take more than 400 ms to do everything (including a bunch of overhead in the UI).

2 Index size

Our indexes include the Tika parsed text. After adding offsets, two examples test indexes change size by the following.
Before 27.4 MB -> after 34.1 MB (766 documents) = 24% increase
Before 941 MB -> after 1412 MB (9705 documents) = 50% increase (many files occur twice in this corpus, so I believe the index overhead was inherently less thus adding offsets were a large percentage.

Another index of 10,742 files (with less repeats than the 9705 file index above) including offsets and the parsed text resulted in an on-disk size of 537 MB which was about 40% of the size of all files (1.3 GB) and 198% of the total of all Tika parsed text (271 MB).

Note, I don't consider these large examples.

Conclusions

An index with parsed text including term vectors with positions and offsets is roughly half text and half indexing information.
Since half our index is the parsed text, the increase in size when adding offsets would be larger (x2) for indexes without the parsed text.

Adding offsets to an index with parsed text can add 25-50% to the size of an index and can REALLY HELP highlight large files often with an approximately a_MINIMUM_ of 2-2.5x improvement.
I think the folks at Lucene were being very conservative in their estimates of speed increases.

YMMV,

-Paul

Thanks for all details!

Since I'm seeing results >1000 times slower with highlighting on large
objects a speedup of 2.5 times will not really help. I will have to
investigate what limitations my customer can accept with regards to
the highlighting.

/Jan

On 7/18/2012 1:53 PM, Martijn van Groningen wrote:

Depends, but fast vector highlighting is around 2.5 times faster than
normal highlighting.

I just finished a small experiment related to highlighting and use of FastVectorHighlighter.

1 Speed

I found that when human generated documents (as opposed to product catalog entries or log entries etc.)
where a lot faster then the quoted "2.5 times" (from the original release notes) particularly for large files.
Like Jan, I had the occasional very large file. Any query that returned two such ridiculous files and a few very large files would time out before returning and might run for
take a total of 2 seconds! Queries are for a page of 10 results each result with one (best) hit highlight fragment.
The things that highlighted fast before were many times faster (like 5-10x) while the slow ones were still 2+ times faster!
I can't find any queries (1 page of 10 results) that take more than 400 ms to do everything (including a bunch of overhead in the UI).

2 Index size

Our indexes include the Tika parsed text. After adding offsets, two examples test indexes change size by the following.
Before 27.4 MB -> after 34.1 MB (766 documents) = 24% increase
Before 941 MB -> after 1412 MB (9705 documents) = 50% increase (many files occur twice in this corpus, so I believe the index overhead was inherently less thus adding offsets were a large percentage.

Another index of 10,742 files (with less repeats than the 9705 file index above) including offsets and the parsed text resulted in an on-disk size of 537 MB which was about 40% of the size of all files (1.3 GB) and 198% of the total of all Tika parsed text (271 MB).

Note, I don't consider these large examples.

Conclusions

An index with parsed text including term vectors with positions and offsets is roughly half text and half indexing information.
Since half our index is the parsed text, the increase in size when adding offsets would be larger (x2) for indexes without the parsed text.

Adding offsets to an index with parsed text can add 25-50% to the size of an index and can REALLY HELP highlight large files often with an approximately a_MINIMUM_ of 2-2.5x improvement.
I think the folks at Lucene were being very conservative in their estimates of speed increases.

YMMV,

-Paul