Highlight fragments of fields that use the html_strip char filter still contain HTML tags

paulmuller · July 29, 2024, 2:08pm

Why do highlight fragments of HTML-stripped fields still contain HTML tags? From all I know based on what the documentation says, I should get the stripped but highlighted text?

Here's what I have.

HTML analyzer

GET /my-index/_settings shows that I have a standard_html analyzer with an html_strip char filter:

{
  "customer-portal": {
    "settings": {
      "index": {
        "analysis": {
          "analyzer": {
            "standard_html": {
              "filter": [
                "lowercase"
              ],
              "char_filter": [
                "html_strip"
              ],
              "type": "custom",
              "tokenizer": "standard"
            }
          }
        }
      }
    }
  }
}

The analyzer does work as expected

GET /my-index/_analyze
{
  "analyzer": "standard_html",
  "text": "</a>Appointment types</h2>"
}

I get two tokens and no HTML tags

{
  "tokens": [
    {
      "token": "appointment",
      "start_offset": 4,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "types",
      "start_offset": 16,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Indexed HTML field

I have an index that uses the above analyzer to index a content field

GET /my-index/_mapping

{
  "customer-portal": {
    "mappings": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "standard_html"
        }
      }
    }
  }
}

Single HTML document in index

POST /my-index/_doc
{
  "content": "<p>This is a <strong>superduper</strong> test.</p>"
}

Failed highlighting

AFAIU a highlighted fragment for the content field should NOT contain any HTML tags besides the ones defined through the pre/post tags property. However, that's not what I see.

GET /my-index/_search
{
  "query": {
    "match": {
      "content": "superduper"
    }
  },
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}

The <p> tags in _source.content is expected - the same in highlight.content isn't.

{
  "hits": {
    "hits": [
      {
        "_source": {
          "content": "<p>This is a <strong>superduper</strong>.</p>"
        },
        "highlight": {
          "content": [
            "<p>This is a <strong><em>superduper</strong></em>.</p>"
          ]
        }
      }
    ]
  }
}

In my real-life index the effect of this behavior is much worse because the highlight fragment obviously might contain invalid HTML (open and/or close tags missing). Examples:

"""<div class="paragraph">
 <p>In order to connect with Microsoft <em>Graph</em> to read/write calendar entries,"""

"""="5"></i><b>5</b></td>
    <td>The maximum number of items to return for requests to the Microsoft <em>Graph</em>"""

"""class="fa icon-tip" title="Tip"></i></td>
     <td class="content">To analyze issues related to the <em>Graph</em>"""

dadoonet · July 30, 2024, 9:06am

From Elastic Search to Elasticsearch

dadoonet · July 30, 2024, 9:10am

It's because highlighters are working on the source text which has the html content.

If you want to "alter" the source, I'd recommend using HTML strip processor | Elasticsearch Guide [8.14] | Elastic as this will modify at index time the text. Then, highlighting will work the way you like.

paulmuller · July 30, 2024, 6:14pm

Thanks! I went back to the docs and found the relevant section right at the top. No idea how I missed that:

Highlighting requires the actual content of a field. If the field is not stored (the mapping does not set store to true), the actual _source is loaded and the relevant field is extracted from _source.

In theory, does that mean that I could also store the content field to get the highlighter results I need? Since I use the html_strip char filter, the stored values won't contain any HTML tags, right?

Anyway, I the meantime I already resorted to loading HTML-free content into ES. I get this for free from the DOM parser I already have in place (Jsoup; using .text() instead of .html()).

system · August 27, 2024, 6:14pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
HTML_strip / highlight combo limitations? Elasticsearch	3	870	July 6, 2017
Highlighting html text Elasticsearch	2	344	July 6, 2017
Html stripped highlighted text from html Content field Elasticsearch	9	2946	July 6, 2017
HTML Filter - How do I use it in a search? Elasticsearch	5	594	March 16, 2018
Highlighting leads to html tags overlap Elasticsearch	5	3193	September 14, 2018

Highlight fragments of fields that use the html_strip char filter still contain HTML tags

HTML analyzer

Indexed HTML field

Single HTML document in index

Failed highlighting

Related topics