Why isn't Lucene's BreakIteratorBoundaryScanner supported?

Shai_Erera · February 17, 2017, 9:23am

Hi, I noticed that when requesting to perform highlighting of search results (with FVH), one can only specify the boundary characters. I also reviewed the FastVectorHighlighter code (in Elastic) and saw it only creates the SimpleBoundaryScanner.

Is there a reason why BreakIteratorBoundaryScanner is not supported? I would like to break my snippets on sentences, and SimpleBoundaryScanner does not do a good job at it.

Lucene's FastVectorHighlighter supports this scanner, and also Solr, so was curious to know why it's not supported in ES. If there is no special reason, would it be OK if I created a PR to add support for it?

jimczi · February 17, 2017, 12:59pm

Hi,
No specific reason so you can open a PR for it ! Though the new unified highlighter which handles fields with term vectors is based on the BreakIterator API so you can also try to switch to this new highlighter instead ?:
https://www.elastic.co/guide/en/elasticsearch/reference/5.x/search-request-highlighting.html#_unified_highlighter

Shai_Erera · February 17, 2017, 4:09pm

Thanks @jimczi. The unified highlighter seems to only return sentences? I.e. I cannot ask it to return a snippet of specific length, unlike FVH which if asked to return a snippet of length 200 and use SENTENCE breakIterator, will return a snippet of approximately that length, but will cover multiple sentences.

I will look into adding break iterator support to FVH and post back here as I make progress.

jimczi · February 17, 2017, 4:25pm

That's the current state for the unified highlighter yes but we're planning to add support for target length snippets. This is already supported in Lucene:
https://issues.apache.org/jira/browse/LUCENE-7620
I just need to find some time to work on it.
Though using a sentence break iterator would break the target length expectation of the FVH highlighter. A sentence break iterator is not bounded like the SimpleBoundaryScanner which by default checks a maximum of 20 characters on the left and right to find a boundary.
I think that's one explanation why it's not currently supported in ES

Shai_Erera · February 17, 2017, 4:44pm

That sounds reasonable to me. If you ask for length 200 and a SENTENCE
breankIterator, I think it's OK to return a snippet of 200+ characters,
that ends at the end of a sentence. That's how it works in Lucene and also
in Solr, and it's anyway a user's choice (and we can keep 'simple' the
default) so users who choose that, I believe, won't expect a snippet of 200
characters ...

jimczi · February 17, 2017, 5:01pm

That's true but my point is that the sentence break iterator is very sensitive so it can produce very long sentences. I think it's fine if your input is composed of perfectly structured text with valid sentence boundaries.

I will look into adding break iterator support to FVH and post back here as I make progress.

Seems like [LUCENE-7620] UnifiedHighlighter: add target character width BreakIterator wrapper - ASF JIRA is exactly what you're looking for then.
I think it's better to enrich the unified highlighter, it is still experimental but very active in Lucene so maybe a good opportunity to switch ?

Shai_Erera · February 19, 2017, 10:33am

Thanks @jimczi. I created a PR https://github.com/elastic/elasticsearch/pull/23248 anyway to add support for "break_iterator" to FVH. Since this is already supported in Lucene, it was quite a straightforward implementation in Elastic. Note that I did not yet add support for "WORD", "CHARACTER" and "LINE" - wanted to get initial feedback before I do that. If the PR looks good on your end, supporting all types means adding another property, which is trivial to do.

About Unified highlighter, I understand what you're saying, but I still think using FVH is useful (or maybe until Unified is out of danger experimental land) and therefore this addition will give users more control and flexibility. What do you think?

system · March 19, 2017, 10:34am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unified highlighter snippet fragmenter issues Elasticsearch	4	504	February 25, 2022
Fast vector highlighter (fvh) making searches slower Elasticsearch	7	1214	December 29, 2021
FVH does not highlight every match Elasticsearch	1	528	December 29, 2017
Fvh Highlighting taking longer time than unified highlighting Elasticsearch	5	531	December 19, 2018
FVH Highlighting Limitations? Elasticsearch	2	638	July 5, 2017

Why isn't Lucene's BreakIteratorBoundaryScanner supported?

Related topics