Highlighting leads to html tags overlap

We are dealing with html content storing in Elasticsearch and our task is to search and highlight matched text.

Issue occurs with text like this -> <span>Hello</span>World and search by HelloWorld word which leads to the response like this:
<span><hi>Hello</span>World</hi>. I see that this situation might be difficult to resolve.

Please see my mapping config below:

{
  "settings": {
    "analysis": {
      "tokenizer": {
    	"ngram_tokenizer": {
    		"type": "ngram",
    		"min_gram": 2,
    		"max_gram": 30,
    		"token_chars": ["letter", "digit"]
    	}	
      },
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "english,british",
            "usa,united states of america,us"
          ]
        }
      },
      "char_filter": {
      	"my_html_filter": {
      		"type": "html_strip"
      	}
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "ngram_tokenizer",
          "char_filter": ["my_html_filter"],
          "filter": [
            "lowercase",
            "asciifolding",
            "my_synonym_filter"
          ]
        }
      }
    }
  },
  "mappings": {
	"doc": {
		"properties": {
			"text": {
				"type": "text",
				"analyzer": "my_analyzer",
				"search_analyzer": "standard"
			}
		}
	}
  }
}

Pushing the doc to doc index:

{
	"text": "<span>Hello</span>World"
}

Search:

{
    "query": {
        "multi_match": {
            "fields": ["text"],
            "query": "helloworld"
        
        }
    },
    "highlight" : {
        "type": "unified",
        "require_field_match": false,
        "number_of_fragments": 0,
        "fields" : {
            "*" : { "pre_tags" : ["<hi>"], "post_tags" : ["</hi>"] }
        }
    }
}

Result:

{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.29032138,
        "hits": [
            {
                "_index": "test",
                "_type": "doc",
                "_id": "1",
                "_score": 0.29032138,
                "_source": {
                    "text": "<span>Hello</span>World"
                },
                "highlight": {
                    "text": [
                        "<span><hi>Hello</span>World</hi>"
                    ]
                }
            }
        ]
    }
}

As you see html tags are overlapped in the result.
Please describe what is the best way to resolve this issue and make it not to overlap and receive the result like this:
<span><hi>Hello</hi></span></hi>World</hi>

Thanks!

The use of html_strip is only to prevent things like the tag span from appearing in the search index. It does not change the original string stored in the source of the document which means highlighters will still be presented with the text and the markup.

Don't pass HTML to Elasticsearch. You'll need to cleanse the text upstream - either in your application code or perhaps as part of an ingest pipeline.

Thanks for the comment!

But task is to display the document as is (with html markup) with highlighted items. Later these documents should be displayed on UI with all the formatting preserved.
So that's not the case for us (to remove html markup).
Resolve markup on our side - is a solution, just trying to find the way to do this on elastic side.

It's too hard for the following reasons:

  1. "snippeting" can break markup - highlighting is also about summarising a doc to the best matching sections only. This task is further complicated if the html begin/end tags in the original doc don't tie in neatly with the snippeting logic's idea of an interesting section. Tags may need inserting just to open or close the dangling sections deemed uninteresting.
  2. highlighted sections can conflict - a phrase query might match a number of tokens and it's not easy to mark them as a single <em>...</em> block if there's html tags that either begin or end inside that span.

Ok, I see. Thanks for response!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.