Highlight fragments of fields that use the html_strip char filter still contain HTML tags

Why do highlight fragments of HTML-stripped fields still contain HTML tags? From all I know based on what the documentation says, I should get the stripped but highlighted text?

Here's what I have.

HTML analyzer

GET /my-index/_settings shows that I have a standard_html analyzer with an html_strip char filter:

{
  "customer-portal": {
    "settings": {
      "index": {
        "analysis": {
          "analyzer": {
            "standard_html": {
              "filter": [
                "lowercase"
              ],
              "char_filter": [
                "html_strip"
              ],
              "type": "custom",
              "tokenizer": "standard"
            }
          }
        }
      }
    }
  }
}

The analyzer does work as expected

GET /my-index/_analyze
{
  "analyzer": "standard_html",
  "text": "</a>Appointment types</h2>"
}

I get two tokens and no HTML tags

{
  "tokens": [
    {
      "token": "appointment",
      "start_offset": 4,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "types",
      "start_offset": 16,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Indexed HTML field

I have an index that uses the above analyzer to index a content field

GET /my-index/_mapping

{
  "customer-portal": {
    "mappings": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "standard_html"
        }
      }
    }
  }
}

Single HTML document in index

POST /my-index/_doc
{
  "content": "<p>This is a <strong>superduper</strong> test.</p>"
}

Failed highlighting

AFAIU a highlighted fragment for the content field should NOT contain any HTML tags besides the ones defined through the pre/post tags property. However, that's not what I see.

GET /my-index/_search
{
  "query": {
    "match": {
      "content": "superduper"
    }
  },
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}

The <p> tags in _source.content is expected - the same in highlight.content isn't.

{
  "hits": {
    "hits": [
      {
        "_source": {
          "content": "<p>This is a <strong>superduper</strong>.</p>"
        },
        "highlight": {
          "content": [
            "<p>This is a <strong><em>superduper</strong></em>.</p>"
          ]
        }
      }
    ]
  }
}

In my real-life index the effect of this behavior is much worse because the highlight fragment obviously might contain invalid HTML (open and/or close tags missing). Examples:

"""<div class="paragraph">
 <p>In order to connect with Microsoft <em>Graph</em> to read/write calendar entries,"""
"""="5"></i><b>5</b></td>
    <td>The maximum number of items to return for requests to the Microsoft <em>Graph</em>"""
"""class="fa icon-tip" title="Tip"></i></td>
     <td class="content">To analyze issues related to the <em>Graph</em>"""

From Elastic Search to Elasticsearch

It's because highlighters are working on the source text which has the html content.

If you want to "alter" the source, I'd recommend using HTML strip processor | Elasticsearch Guide [8.14] | Elastic as this will modify at index time the text. Then, highlighting will work the way you like.

Thanks! I went back to the docs and found the relevant section right at the top. No idea how I missed that:

Highlighting requires the actual content of a field. If the field is not stored (the mapping does not set store to true), the actual _source is loaded and the relevant field is extracted from _source.

In theory, does that mean that I could also store the content field to get the highlighter results I need? Since I use the html_strip char filter, the stored values won't contain any HTML tags, right?

Anyway, I the meantime I already resorted to loading HTML-free content into ES. I get this for free from the DOM parser I already have in place (Jsoup; using .text() instead of .html()).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.