Postings highlighter returns too many sentences

Sheker · March 29, 2016, 10:26pm

Hi all,

I have a problem with the postings highlighter. According to the docs:
"...the postings highlighter... outputs sentences regardless of their length."

So, by setting:
"number_of_fragments" : 1
I should only get one sentence back. This is what happens 90% of the times but sometimes I get a really long text which is obviously more than 1 sentence. For example: (the highlighted words are river and polluted)

It is a collegiate body with an advisory and deliberative of the Integrated Water Resources Management - working on Unit Water Resources Management 10, built by the state, municipalities and civil society, equally. [ 2 ] This committee took the initiative of civil society and currently includes 34 municipalities, 18 were located in Sorocaba River basin and 16 situated in the sub-basin of the upper Middle Tietê. [ 3 ] It has been a very polluted river due to industrial activities, mining, sewage without treatment, etc.

There are 3 sentences and the first two don't even have the highlighted words in them.
I think there is a bug here making the postings highlighter ignore '.' when followed by a '[' . I've noticed this to be the case in all bad highlighting results.

Is this a known bug? or am I missing something?
Thanks

Sheker · March 30, 2016, 11:30am

Update:
Some testing shows this happens whenever '.' is not followed by a capital letter

javanna · April 1, 2016, 12:33pm

The lucene postings highlighter uses a java break iterator to split the text into sentences before performing the highlighting on those. What you get back is seen as a whole single sentence by the Java sentence break iterator. The reason seems to be indeed that after the punctuation mark '.' there should be a space and then a capital letter, but it doesn't happen in your text unfortunately. I am not sure how to fix this, it is not something that we can fix in elasticsearch as we simply expose this functionality. And the problem is not in lucene either as the break iterator is part of the jdk. What you can do is plug in a different break iterator, there may be other ones available that know better how to deal with this specific problem, probably open source ones, but I am not sure.

Sheker · April 1, 2016, 10:48pm

Thanks, I understand.
For now my solution to post-process the highlighted result and and cut it 60 chars before the first highlighted word and 60 chars after the last.