Why isn't Lucene's BreakIteratorBoundaryScanner supported?

Hi, I noticed that when requesting to perform highlighting of search results (with FVH), one can only specify the boundary characters. I also reviewed the FastVectorHighlighter code (in Elastic) and saw it only creates the SimpleBoundaryScanner.

Is there a reason why BreakIteratorBoundaryScanner is not supported? I would like to break my snippets on sentences, and SimpleBoundaryScanner does not do a good job at it.

Lucene's FastVectorHighlighter supports this scanner, and also Solr, so was curious to know why it's not supported in ES. If there is no special reason, would it be OK if I created a PR to add support for it?

No specific reason so you can open a PR for it ! Though the new unified highlighter which handles fields with term vectors is based on the BreakIterator API so you can also try to switch to this new highlighter instead ?:

Thanks @jimczi. The unified highlighter seems to only return sentences? I.e. I cannot ask it to return a snippet of specific length, unlike FVH which if asked to return a snippet of length 200 and use SENTENCE breakIterator, will return a snippet of approximately that length, but will cover multiple sentences.

I will look into adding break iterator support to FVH and post back here as I make progress.

That's the current state for the unified highlighter yes but we're planning to add support for target length snippets. This is already supported in Lucene:
I just need to find some time to work on it.
Though using a sentence break iterator would break the target length expectation of the FVH highlighter. A sentence break iterator is not bounded like the SimpleBoundaryScanner which by default checks a maximum of 20 characters on the left and right to find a boundary.
I think that's one explanation why it's not currently supported in ES :wink:

That sounds reasonable to me. If you ask for length 200 and a SENTENCE
breankIterator, I think it's OK to return a snippet of 200+ characters,
that ends at the end of a sentence. That's how it works in Lucene and also
in Solr, and it's anyway a user's choice (and we can keep 'simple' the
default) so users who choose that, I believe, won't expect a snippet of 200
characters ...

That's true but my point is that the sentence break iterator is very sensitive so it can produce very long sentences. I think it's fine if your input is composed of perfectly structured text with valid sentence boundaries.

I will look into adding break iterator support to FVH and post back here as I make progress.

Seems like https://issues.apache.org/jira/browse/LUCENE-7620 is exactly what you're looking for then.
I think it's better to enrich the unified highlighter, it is still experimental but very active in Lucene so maybe a good opportunity to switch ?

Thanks @jimczi. I created a PR https://github.com/elastic/elasticsearch/pull/23248 anyway to add support for "break_iterator" to FVH. Since this is already supported in Lucene, it was quite a straightforward implementation in Elastic. Note that I did not yet add support for "WORD", "CHARACTER" and "LINE" - wanted to get initial feedback before I do that. If the PR looks good on your end, supporting all types means adding another property, which is trivial to do.

About Unified highlighter, I understand what you're saying, but I still think using FVH is useful (or maybe until Unified is out of danger experimental land) and therefore this addition will give users more control and flexibility. What do you think?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.