Highlighting leads to html tags overlap

7alcon · August 17, 2018, 10:09am

We are dealing with html content storing in Elasticsearch and our task is to search and highlight matched text.

Issue occurs with text like this -> HelloWorld and search by HelloWorld word which leads to the response like this:
<hi>HelloWorld</hi>. I see that this situation might be difficult to resolve.

Please see my mapping config below:

{
  "settings": {
    "analysis": {
      "tokenizer": {
    	"ngram_tokenizer": {
    		"type": "ngram",
    		"min_gram": 2,
    		"max_gram": 30,
    		"token_chars": ["letter", "digit"]
    	}	
      },
      "filter": {
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "english,british",
            "usa,united states of america,us"
          ]
        }
      },
      "char_filter": {
      	"my_html_filter": {
      		"type": "html_strip"
      	}
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "ngram_tokenizer",
          "char_filter": ["my_html_filter"],
          "filter": [
            "lowercase",
            "asciifolding",
            "my_synonym_filter"
          ]
        }
      }
    }
  },
  "mappings": {
	"doc": {
		"properties": {
			"text": {
				"type": "text",
				"analyzer": "my_analyzer",
				"search_analyzer": "standard"
			}
		}
	}
  }
}

Pushing the doc to doc index:

{
	"text": "<span>Hello</span>World"
}

Search:

{
    "query": {
        "multi_match": {
            "fields": ["text"],
            "query": "helloworld"
        
        }
    },
    "highlight" : {
        "type": "unified",
        "require_field_match": false,
        "number_of_fragments": 0,
        "fields" : {
            "*" : { "pre_tags" : ["<hi>"], "post_tags" : ["</hi>"] }
        }
    }
}

Result:

{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.29032138,
        "hits": [
            {
                "_index": "test",
                "_type": "doc",
                "_id": "1",
                "_score": 0.29032138,
                "_source": {
                    "text": "<span>Hello</span>World"
                },
                "highlight": {
                    "text": [
                        "<span><hi>Hello</span>World</hi>"
                    ]
                }
            }
        ]
    }
}

As you see html tags are overlapped in the result.
Please describe what is the best way to resolve this issue and make it not to overlap and receive the result like this:
<hi>Hello</hi></hi>World</hi>

Thanks!

Mark_Harwood · August 17, 2018, 10:27am

The use of html_strip is only to prevent things like the tag span from appearing in the search index. It does not change the original string stored in the source of the document which means highlighters will still be presented with the text and the markup.

Don't pass HTML to elasticsearch. You'll need to cleanse the text upstream - either in your application code or perhaps as part of an ingest pipeline.

7alcon · August 17, 2018, 11:52am

Thanks for the comment!

But task is to display the document as is (with html markup) with highlighted items. Later these documents should be displayed on UI with all the formatting preserved.
So that's not the case for us (to remove html markup).
Resolve markup on our side - is a solution, just trying to find the way to do this on elastic side.

Mark_Harwood · August 17, 2018, 12:00pm

It's too hard for the following reasons:

"snippeting" can break markup - highlighting is also about summarising a doc to the best matching sections only. This task is further complicated if the html begin/end tags in the original doc don't tie in neatly with the snippeting logic's idea of an interesting section. Tags may need inserting just to open or close the dangling sections deemed uninteresting.
highlighted sections can conflict - a phrase query might match a number of tokens and it's not easy to mark them as a single ... block if there's html tags that either begin or end inside that span.

7alcon · August 17, 2018, 12:02pm

Ok, I see. Thanks for response!

system · September 14, 2018, 12:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search and highlight html Elasticsearch	1	420	January 12, 2017
Highlight fragments of fields that use the html_strip char filter still contain HTML tags Elasticsearch	4	60	August 27, 2024
Highlighting html text Elasticsearch	2	348	July 6, 2017
Highlight not always shown Elasticsearch	5	1215	July 6, 2017
HTML_strip / highlight combo limitations? Elasticsearch	3	876	July 6, 2017

Highlighting leads to html tags overlap

Related topics