[ANN] Elasticsearch experimental highlighter

I've been working on a new highlighter on and off for a few weeks and I'd
love for other folks to try it out:

You should try it because:

  1. Its pretty quick.
  2. It supports many of the features of the other highlighters and lets you
    combine them in new ways.
  3. Has a few tricks that none other highlighters have.
  4. It doesn't require that you store any extra data information but will
    use what it can to speed itself up.

I've installed it on our beta
sitehttp://simple.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=chess+players&fulltext=Searchso
you can run see it in action without installing it.

Let me expand on my list above:
It doesn't require any extra data and is nice and fast that way for short
fields. Once fields get longer [0] reanalyzing them starts to take too
long so it is best to store offsets in the postings just like the postings
highlighter. It can use term vectors the same way that the fast vector
highlighter can but that is slower than postings and takes up more space.

It supports three fragmenters: one that mimics the postings highlighter,
one that mimics the fast vector highlighter, and one that always highlights
the whole value.

It supports matched_fields, no_match_size, and most everything else in the
highlight api. It doesn't support require_field_match though.

It adds a handful of tricks like returning the top scoring snippets in
document order and weighing terms that appear early in the document
higher. Nothing difficult, but still cute tricks. Its reasonably easy to
implement new tricks so if you have any ideas I'd love to hear them.

I don't think it is really ready for production usage yet but I'd like to
get there in a week or two.

Thanks for reading,

Nik

[0]: I haven't done the measurements to figure out how long the field has
to be before it is faster to use postings then reanalyze it. I did the
math a few months ago for how long the field has to be before vectors
become faster. It was a couple of KB for my analysis chain but I'm not
sure any of that holds true for this highlighter. It could be more or less.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2ZpSdfcko5DtT6YNh1yjKG-NOek41ot%2BcPY1D84uDkHg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I've just release version 0.0.3 of this plugin. It fixes:

  1. An error when returning a no match fragment with the sentence
    fragmenter if the no_match_size + max_scan is greater than the size of the
    document.
  2. Multi-valued fields using the analyze hit_source were pretty broken.
    The offsets would be wrong causing garbled highlights or errors.
  3. Fields inside objects would always return no hits for postings and
    vectors hit sources.

New tricks:

  1. The max_fragments_scored option can be used to limit the number of
    fragments scored when using score order or the top_scoring option. You can
    use it to prevent highlighting documents with many hits from eating a ton
    of CPU. This is more useful with the sentence fragmenter then the scan
    fragmenter. Still, if your documents are megabytes of text you might want
    to try it.
  2. The fetch_fields option can be used to return fields next to the
    highlighted field. Its a little jangly but it gets the job done if you are
    careful.

Nik

On Thu, Apr 10, 2014 at 4:04 PM, Nikolas Everett nik9000@gmail.com wrote:

I've been working on a new highlighter on and off for a few weeks and I'd
love for other folks to try it out:
GitHub - wikimedia/search-highlighter: Github mirror of "search/highlighter" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing

You should try it because:

  1. Its pretty quick.
  2. It supports many of the features of the other highlighters and lets
    you combine them in new ways.
  3. Has a few tricks that none other highlighters have.
  4. It doesn't require that you store any extra data information but will
    use what it can to speed itself up.

I've installed it on our beta sitehttp://simple.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=chess+players&fulltext=Searchso you can run see it in action without installing it.

Let me expand on my list above:
It doesn't require any extra data and is nice and fast that way for short
fields. Once fields get longer [0] reanalyzing them starts to take too
long so it is best to store offsets in the postings just like the postings
highlighter. It can use term vectors the same way that the fast vector
highlighter can but that is slower than postings and takes up more space.

It supports three fragmenters: one that mimics the postings highlighter,
one that mimics the fast vector highlighter, and one that always highlights
the whole value.

It supports matched_fields, no_match_size, and most everything else in the
highlight api. It doesn't support require_field_match though.

It adds a handful of tricks like returning the top scoring snippets in
document order and weighing terms that appear early in the document
higher. Nothing difficult, but still cute tricks. Its reasonably easy to
implement new tricks so if you have any ideas I'd love to hear them.

I don't think it is really ready for production usage yet but I'd like to
get there in a week or two.

Thanks for reading,

Nik

[0]: I haven't done the measurements to figure out how long the field has
to be before it is faster to use postings then reanalyze it. I did the
math a few months ago for how long the field has to be before vectors
become faster. It was a couple of KB for my analysis chain but I'm not
sure any of that holds true for this highlighter. It could be more or less.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1nBOqN_tDNebss99kZUz5PP1zyM%2BBC5V-n3jsMSwkMJQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Nikolas,

I'm likely to test this in the next couple of weeks (I'm still on 0.90.9)
however I've a question on performance. 'Its pretty quick' meaning
comparable performance to the posting highlighter, the fast vector
highlighter, or just quick enough for your use case?

The reason why I'm asking is because highlighting performance is the
largest issue I face currently. Our documents have hundreds of very short
fields (well over a thousand if you count the sub fields in a multi-field
field) and listing every field/sub field to highlight causes queries to be
10-20x slower than highlighting just a single field (100ms -> 2100ms for
example). I can't use the _all field because I need to know the actual
field that was highlighted and only the fvh highlighter returns the high
quality results we need. I'm actually toying with the idea of doing a
two-phase search where the first phase only highlights a few fields that
commonly hit with a second phase that only searches the remaining hits that
didn't highlight on the first pass. That approach may work but I'd rather
just have a highlighter that was faster :slight_smile:

All the best,

Bruce Ritchie

On Thursday, April 10, 2014 4:04:57 PM UTC-4, Nikolas Everett wrote:

I've been working on a new highlighter on and off for a few weeks and I'd
love for other folks to try it out:
GitHub - wikimedia/search-highlighter: Github mirror of "search/highlighter" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing

You should try it because:

  1. Its pretty quick.
  2. It supports many of the features of the other highlighters and lets
    you combine them in new ways.
  3. Has a few tricks that none other highlighters have.
  4. It doesn't require that you store any extra data information but will
    use what it can to speed itself up.

I've installed it on our beta site
http://simple.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=chess+players&fulltext=Search
so you can run see it in action without installing it.

Let me expand on my list above:
It doesn't require any extra data and is nice and fast that way for short
fields. Once fields get longer [0] reanalyzing them starts to take too
long so it is best to store offsets in the postings just like the postings
highlighter. It can use term vectors the same way that the fast vector
highlighter can but that is slower than postings and takes up more space.

It supports three fragmenters: one that mimics the postings highlighter,
one that mimics the fast vector highlighter, and one that always highlights
the whole value.

It supports matched_fields, no_match_size, and most everything else in the
highlight api. It doesn't support require_field_match though.

It adds a handful of tricks like returning the top scoring snippets in
document order and weighing terms that appear early in the document
higher. Nothing difficult, but still cute tricks. Its reasonably easy to
implement new tricks so if you have any ideas I'd love to hear them.

I don't think it is really ready for production usage yet but I'd like to
get there in a week or two.

Thanks for reading,

Nik

[0]: I haven't done the measurements to figure out how long the field has
to be before it is faster to use postings then reanalyze it. I did the
math a few months ago for how long the field has to be before vectors
become faster. It was a couple of KB for my analysis chain but I'm not
sure any of that holds true for this highlighter. It could be more or less.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7b125714-48dd-4bca-a58d-d56acac94d47%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Bruce,

I'm not actually sure it'll work on 0.90.X - I didn't start working on it
until 1.1.0.

"Its pretty quick" means lots of things, unfortunately. If you configure
it to segment the source like the postings highlighter it is typically
about 10% slower then the posting highlighter. If you configure it to
segment more like the FVH (the default) it is generally faster then the
posting highlighter.

What feature of the fvh do you need? I didn't implement them all, in
particular, I don't have require_field_match support. In recent releases
I've grown phrase support and I'll make another release sometime soon that
fixes some bugs there. It might be best to just try it an see if it works
for you.

Before I deployed this highlighting was the largest time consumer on my
cluster and after its pretty much vanished. The fvh can be very slow at
some things.

Just turning on the highlighter may not actually be more efficient because
you have term vectors on each of your fields. The highlighter will attempt
to use them but that might not be the best choice everwhere. For short
fields its probably better to reanalyze them then load the term vectors.
I'm not clear on exactly how many characters or words cause a field to be
"short" in this way, but I've seen it happen. Also, for the longer fields,
you are probably better of switching from term vectors
with_positions_offsets to storing the offsets in the postings. This means
configuring the field "as though" you were going to use the postings
highlighter. The term vectors might be faster in some cases, but I don't
know which. You can force reanalyzing the fields by setting the
"hit_source" to "analyze".

Anyway, let me know how it goes,

NIk

On Thu, May 29, 2014 at 3:26 PM, Bruce Ritchie bruce.ritchie@gmail.com
wrote:

Hi Nikolas,

I'm likely to test this in the next couple of weeks (I'm still on 0.90.9)
however I've a question on performance. 'Its pretty quick' meaning
comparable performance to the posting highlighter, the fast vector
highlighter, or just quick enough for your use case?

The reason why I'm asking is because highlighting performance is the
largest issue I face currently. Our documents have hundreds of very short
fields (well over a thousand if you count the sub fields in a multi-field
field) and listing every field/sub field to highlight causes queries to be
10-20x slower than highlighting just a single field (100ms -> 2100ms for
example). I can't use the _all field because I need to know the actual
field that was highlighted and only the fvh highlighter returns the high
quality results we need. I'm actually toying with the idea of doing a
two-phase search where the first phase only highlights a few fields that
commonly hit with a second phase that only searches the remaining hits that
didn't highlight on the first pass. That approach may work but I'd rather
just have a highlighter that was faster :slight_smile:

All the best,

Bruce Ritchie

On Thursday, April 10, 2014 4:04:57 PM UTC-4, Nikolas Everett wrote:

I've been working on a new highlighter on and off for a few weeks and I'd
love for other folks to try it out: Wikimedia · GitHub
search-highlighter

You should try it because:

  1. Its pretty quick.
  2. It supports many of the features of the other highlighters and lets
    you combine them in new ways.
  3. Has a few tricks that none other highlighters have.
  4. It doesn't require that you store any extra data information but will
    use what it can to speed itself up.

I've installed it on our beta site
http://simple.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=chess+players&fulltext=Search
so you can run see it in action without installing it.

Let me expand on my list above:
It doesn't require any extra data and is nice and fast that way for short
fields. Once fields get longer [0] reanalyzing them starts to take too
long so it is best to store offsets in the postings just like the postings
highlighter. It can use term vectors the same way that the fast vector
highlighter can but that is slower than postings and takes up more space.

It supports three fragmenters: one that mimics the postings highlighter,
one that mimics the fast vector highlighter, and one that always highlights
the whole value.

It supports matched_fields, no_match_size, and most everything else in
the highlight api. It doesn't support require_field_match though.

It adds a handful of tricks like returning the top scoring snippets in
document order and weighing terms that appear early in the document
higher. Nothing difficult, but still cute tricks. Its reasonably easy to
implement new tricks so if you have any ideas I'd love to hear them.

I don't think it is really ready for production usage yet but I'd like to
get there in a week or two.

Thanks for reading,

Nik

[0]: I haven't done the measurements to figure out how long the field has
to be before it is faster to use postings then reanalyze it. I did the
math a few months ago for how long the field has to be before vectors
become faster. It was a couple of KB for my analysis chain but I'm not
sure any of that holds true for this highlighter. It could be more or less.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7b125714-48dd-4bca-a58d-d56acac94d47%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7b125714-48dd-4bca-a58d-d56acac94d47%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0navQAoyD7ZBuiDt0pyyqOb8_DphEwTmvym%3D1Jgrgrmw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.