HTML_strip / highlight combo limitations?

Hi,

To me it seems that HTML_strip char filter and highlighting do not always
play perfectly together. For example, it can produce invalid HTML output,
like: "token" where is part of original content and
is a tag added by highlighter.

More elaborated full recreation script can be found here:

I just wanted to check with ML in case I do something wrong on my side. I
did not check low level details but I would understand that currently it is
probably not possible to expect the highlighter output to be valid HTML in
such case.

In other words, HTML_strip is useful for analysis to remove HTML tags and
translate HTML entities. But if highlighting is needed it requires for the
client to provide stripped content in another field.

Regards,
Lukas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You are correct, HtmlStripCharFilter would compute offsets so that the
opening tag is before the start offset and the closing tag is before the
end offset, so this doesn't play nicely with highlighting. This is due to
the way CharFilters correct offsets. I couldn't find any issue related to
this problem, so maybe you could open one in Lucene's JIRA?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yea, may be it is worth opening an issue in Lucene JIRA.

Lukas

On Wed, Jun 19, 2013 at 11:35 AM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

You are correct, HtmlStripCharFilter would compute offsets so that the
opening tag is before the start offset and the closing tag is before the
end offset, so this doesn't play nicely with highlighting. This is due to
the way CharFilters correct offsets. I couldn't find any issue related to
this problem, so maybe you could open one in Lucene's JIRA?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.