Comparing Plain, FVH and Postings Highlighter Performance


(Alex Roytman) #1

Hi I did some performance comparison of the new Postings highlighter
introduced in 0.9.6-SNAPSHOT with the two existing highlighters.
My case may not be exactly common as I highlight many rather short fields:

out of total 47, about 15 have 10-20 words, 20 have 3-5 words and 12 are
single words (codes)

I need to highlight them all. In fact, I need to show list of fields with
matches.

Here is the stats for highlighting 400 documents (query size=400)

Plain: 450ms
FVH: 1160ms
Post: 150ms
Mix: 250ms (mix uses plain highlighter on all single word fields and
Postings on the rest)

Indexing speed increased by 20% when using Posting vs FVH and size
decreased by about 40%

so it looks like Postings highlighter works very well on many short fields
and FVH is not good for it at all - it is probably lot better on few long
fields but seems to be poor choice for many small ones.

What's your experience?

as a side note, i do not search on all the 47 fields I search on _all but
highlight individual fields so when I said I need to show list of matched
fields I meant it loosely - it is good enough just to show fields with
something to highlight rather than the ones were matched.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #2

PostingsHighlighter is a bit different from the other highlighters in that
it tries to summarize hits based on the query terms while the other
highlighters try to highlight tokens that matched. If that works for you,
then it is probably a good option since it requires little additional index
size and is very fast as you noticed.

The differences in index speed and size don't look surprising to me: term
vectors really take lots of space, which in turn makes indexing slower
because it require more I/O.

FastVectorHighlighter tries to be faster than the plain highlighter by not
re-analyzing content from the stored fields. Unfortunately, it is quite
easy to make the highlighting algorithm trigger quite CPU intensive loops,
which is why the plain highlighter can sometimes be faster, especially is
the analysis chain is light.

On Fri, Oct 25, 2013 at 6:44 PM, AlexR roytmana@gmail.com wrote:

Hi I did some performance comparison of the new Postings highlighter
introduced in 0.9.6-SNAPSHOT with the two existing highlighters.
My case may not be exactly common as I highlight many rather short fields:

out of total 47, about 15 have 10-20 words, 20 have 3-5 words and 12 are
single words (codes)

I need to highlight them all. In fact, I need to show list of fields with
matches.

Here is the stats for highlighting 400 documents (query size=400)

Plain: 450ms
FVH: 1160ms
Post: 150ms
Mix: 250ms (mix uses plain highlighter on all single word fields and
Postings on the rest)

Indexing speed increased by 20% when using Posting vs FVH and size
decreased by about 40%

so it looks like Postings highlighter works very well on many short fields
and FVH is not good for it at all - it is probably lot better on few long
fields but seems to be poor choice for many small ones.

What's your experience?

as a side note, i do not search on all the 47 fields I search on _all but
highlight individual fields so when I said I need to show list of matched
fields I meant it loosely - it is good enough just to show fields with
something to highlight rather than the ones were matched.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Luca Cavanna) #3

Hi Alex,
glad to hear you already experimented with the postings highlighter, and
with good results too!

What you might have noticed is also the big difference in terms of output,
as it outputs nice sentences. Also, the way it internally works is
different from the existing highlighters as this one first splits the text
into sentences (using a break iterator, might not get along with markup),
then it highlights those sentences and scores each sentence based on how
well it represents the document (using bm25 algorithm).

Just saying that it's not all about performance here :wink:

On Sunday, October 27, 2013 10:11:41 PM UTC+1, Adrien Grand wrote:

PostingsHighlighter is a bit different from the other highlighters in that
it tries to summarize hits based on the query terms while the other
highlighters try to highlight tokens that matched. If that works for you,
then it is probably a good option since it requires little additional index
size and is very fast as you noticed.

The differences in index speed and size don't look surprising to me: term
vectors really take lots of space, which in turn makes indexing slower
because it require more I/O.

FastVectorHighlighter tries to be faster than the plain highlighter by not
re-analyzing content from the stored fields. Unfortunately, it is quite
easy to make the highlighting algorithm trigger quite CPU intensive loops,
which is why the plain highlighter can sometimes be faster, especially is
the analysis chain is light.

On Fri, Oct 25, 2013 at 6:44 PM, AlexR <royt...@gmail.com <javascript:>>wrote:

Hi I did some performance comparison of the new Postings highlighter
introduced in 0.9.6-SNAPSHOT with the two existing highlighters.
My case may not be exactly common as I highlight many rather short fields:

out of total 47, about 15 have 10-20 words, 20 have 3-5 words and 12 are
single words (codes)

I need to highlight them all. In fact, I need to show list of fields with
matches.

Here is the stats for highlighting 400 documents (query size=400)

Plain: 450ms
FVH: 1160ms
Post: 150ms
Mix: 250ms (mix uses plain highlighter on all single word fields and
Postings on the rest)

Indexing speed increased by 20% when using Posting vs FVH and size
decreased by about 40%

so it looks like Postings highlighter works very well on many short fields
and FVH is not good for it at all - it is probably lot better on few long
fields but seems to be poor choice for many small ones.

What's your experience?

as a side note, i do not search on all the 47 fields I search on _all but
highlight individual fields so when I said I need to show list of matched
fields I meant it loosely - it is good enough just to show fields with
something to highlight rather than the ones were matched.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alex Roytman) #4

Adrien, Luca,

Thank you for more details. I was thinking of pushing highlighting to javascript client to avoid heavy penalty on es side. But now I think I will stick with posting highlighter :slight_smile:

Alex

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alex Roytman) #5

Luca,

One thing I did not realize is that it would not highlight any wildcard (or
I assume prefix) matches so it is not fully usable in this cases.

Thanks,
Alex

On Monday, October 28, 2013 6:07:16 AM UTC-4, Luca Cavanna wrote:

Hi Alex,
glad to hear you already experimented with the postings highlighter, and
with good results too!

What you might have noticed is also the big difference in terms of output,
as it outputs nice sentences. Also, the way it internally works is
different from the existing highlighters as this one first splits the text
into sentences (using a break iterator, might not get along with markup),
then it highlights those sentences and scores each sentence based on how
well it represents the document (using bm25 algorithm).

Just saying that it's not all about performance here :wink:

On Sunday, October 27, 2013 10:11:41 PM UTC+1, Adrien Grand wrote:

PostingsHighlighter is a bit different from the other highlighters in
that it tries to summarize hits based on the query terms while the other
highlighters try to highlight tokens that matched. If that works for you,
then it is probably a good option since it requires little additional index
size and is very fast as you noticed.

The differences in index speed and size don't look surprising to me: term
vectors really take lots of space, which in turn makes indexing slower
because it require more I/O.

FastVectorHighlighter tries to be faster than the plain highlighter by
not re-analyzing content from the stored fields. Unfortunately, it is quite
easy to make the highlighting algorithm trigger quite CPU intensive loops,
which is why the plain highlighter can sometimes be faster, especially is
the analysis chain is light.

On Fri, Oct 25, 2013 at 6:44 PM, AlexR royt...@gmail.com wrote:

Hi I did some performance comparison of the new Postings highlighter
introduced in 0.9.6-SNAPSHOT with the two existing highlighters.
My case may not be exactly common as I highlight many rather short
fields:

out of total 47, about 15 have 10-20 words, 20 have 3-5 words and 12 are
single words (codes)

I need to highlight them all. In fact, I need to show list of fields
with matches.

Here is the stats for highlighting 400 documents (query size=400)

Plain: 450ms
FVH: 1160ms
Post: 150ms
Mix: 250ms (mix uses plain highlighter on all single word fields and
Postings on the rest)

Indexing speed increased by 20% when using Posting vs FVH and size
decreased by about 40%

so it looks like Postings highlighter works very well on many short
fields
and FVH is not good for it at all - it is probably lot better on few
long fields but seems to be poor choice for many small ones.

What's your experience?

as a side note, i do not search on all the 47 fields I search on _all
but highlight individual fields so when I said I need to show list of
matched fields I meant it loosely - it is good enough just to show fields
with something to highlight rather than the ones were matched.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Luca Cavanna) #6

Hey,
thanks for opening the issue about this (
https://github.com/elasticsearch/elasticsearch/issues/4042). Working on it,
I think we are going to be able to highlight wildcard, prefix, fuzzy and
regexp queries too very soon :wink:

On Fri, Nov 1, 2013 at 5:31 AM, AlexR roytmana@gmail.com wrote:

Luca,

One thing I did not realize is that it would not highlight any wildcard
(or I assume prefix) matches so it is not fully usable in this cases.

Thanks,
Alex

On Monday, October 28, 2013 6:07:16 AM UTC-4, Luca Cavanna wrote:

Hi Alex,
glad to hear you already experimented with the postings highlighter, and
with good results too!

What you might have noticed is also the big difference in terms of
output, as it outputs nice sentences. Also, the way it internally works is
different from the existing highlighters as this one first splits the text
into sentences (using a break iterator, might not get along with markup),
then it highlights those sentences and scores each sentence based on how
well it represents the document (using bm25 algorithm).

Just saying that it's not all about performance here :wink:

On Sunday, October 27, 2013 10:11:41 PM UTC+1, Adrien Grand wrote:

PostingsHighlighter is a bit different from the other highlighters in
that it tries to summarize hits based on the query terms while the other
highlighters try to highlight tokens that matched. If that works for you,
then it is probably a good option since it requires little additional index
size and is very fast as you noticed.

The differences in index speed and size don't look surprising to me:
term vectors really take lots of space, which in turn makes indexing slower
because it require more I/O.

FastVectorHighlighter tries to be faster than the plain highlighter by
not re-analyzing content from the stored fields. Unfortunately, it is quite
easy to make the highlighting algorithm trigger quite CPU intensive loops,
which is why the plain highlighter can sometimes be faster, especially is
the analysis chain is light.

On Fri, Oct 25, 2013 at 6:44 PM, AlexR royt...@gmail.com wrote:

Hi I did some performance comparison of the new Postings highlighter
introduced in 0.9.6-SNAPSHOT with the two existing highlighters.
My case may not be exactly common as I highlight many rather short
fields:

out of total 47, about 15 have 10-20 words, 20 have 3-5 words and 12
are single words (codes)

I need to highlight them all. In fact, I need to show list of fields
with matches.

Here is the stats for highlighting 400 documents (query size=400)

Plain: 450ms
FVH: 1160ms
Post: 150ms
Mix: 250ms (mix uses plain highlighter on all single word fields and
Postings on the rest)

Indexing speed increased by 20% when using Posting vs FVH and size
decreased by about 40%

so it looks like Postings highlighter works very well on many short
fields
and FVH is not good for it at all - it is probably lot better on few
long fields but seems to be poor choice for many small ones.

What's your experience?

as a side note, i do not search on all the 47 fields I search on _all
but highlight individual fields so when I said I need to show list of
matched fields I meant it loosely - it is good enough just to show fields
with something to highlight rather than the ones were matched.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/RJsiq66v5NI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7