To me it seems that HTML_strip char filter and highlighting do not always
play perfectly together. For example, it can produce invalid HTML output,
like: "token" where is part of original content and is a tag added by highlighter.
More elaborated full recreation script can be found here:
I just wanted to check with ML in case I do something wrong on my side. I
did not check low level details but I would understand that currently it is
probably not possible to expect the highlighter output to be valid HTML in
such case.
In other words, HTML_strip is useful for analysis to remove HTML tags and
translate HTML entities. But if highlighting is needed it requires for the
client to provide stripped content in another field.
You are correct, HtmlStripCharFilter would compute offsets so that the
opening tag is before the start offset and the closing tag is before the
end offset, so this doesn't play nicely with highlighting. This is due to
the way CharFilters correct offsets. I couldn't find any issue related to
this problem, so maybe you could open one in Lucene's JIRA?
You are correct, HtmlStripCharFilter would compute offsets so that the
opening tag is before the start offset and the closing tag is before the
end offset, so this doesn't play nicely with highlighting. This is due to
the way CharFilters correct offsets. I couldn't find any issue related to
this problem, so maybe you could open one in Lucene's JIRA?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.