HTML_strip / highlight combo limitations?

Lukas_Vlcek1 · June 18, 2013, 2:03pm

Hi,

To me it seems that HTML_strip char filter and highlighting do not always
play perfectly together. For example, it can produce invalid HTML output,
like: "token" where is part of original content and
is a tag added by highlighter.

More elaborated full recreation script can be found here:

gist.github.com

https://gist.github.com/lukas-vlcek/5805393

gistfile1.sh

curl -X DELETE 'localhost:9200/i/'

curl -X POST 'localhost:9200/i/' -d '{
  "settings" : {
    "number_of_shards" : 1,
    "number_of_replicas" : 0,
    "analysis" : {
      "analyzer" : {
        "content" : {
          "type" : "custom",

This file has been truncated. show original

I just wanted to check with ML in case I do something wrong on my side. I
did not check low level details but I would understand that currently it is
probably not possible to expect the highlighter output to be valid HTML in
such case.

In other words, HTML_strip is useful for analysis to remove HTML tags and
translate HTML entities. But if highlighting is needed it requires for the
client to provide stripped content in another field.

Regards,
Lukas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jpountz · June 19, 2013, 9:35am

You are correct, HtmlStripCharFilter would compute offsets so that the
opening tag is before the start offset and the closing tag is before the
end offset, so this doesn't play nicely with highlighting. This is due to
the way CharFilters correct offsets. I couldn't find any issue related to
this problem, so maybe you could open one in Lucene's JIRA?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Lukas_Vlcek1 · June 19, 2013, 11:12am

Yea, may be it is worth opening an issue in Lucene JIRA.

Lukas

On Wed, Jun 19, 2013 at 11:35 AM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

You are correct, HtmlStripCharFilter would compute offsets so that the
opening tag is before the start offset and the closing tag is before the
end offset, so this doesn't play nicely with highlighting. This is due to
the way CharFilters correct offsets. I couldn't find any issue related to
this problem, so maybe you could open one in Lucene's JIRA?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Highlight fragments of fields that use the html_strip char filter still contain HTML tags Elasticsearch	4	18	August 27, 2024
Stripping html for indexing only? Elasticsearch	3	768	July 6, 2017
How to get char_filter to work? Elasticsearch	14	1144	July 6, 2017
Simple question about html stripping Elasticsearch	4	374	July 6, 2017
Strip_HTML on indexing does not store results? Elasticsearch	10	918	July 6, 2017

HTML_strip / highlight combo limitations?

Related topics