Elasticsearch Highlight the result of script fields

AmirMohammad_Safari · October 8, 2022, 8:22am

I write out an analyzer to remove the HTML tags in my search results, After that I thought I could highlight the results with a common query, But in the highlighting field I got other html contents that you removed with script. Would you please help me to highlight the results without html tags that I saved in my db?
My mapping and settings:

{
  "settings": {
    "analysis": {
      "filter": {
        "my_pattern_replace_filter": {
          "type": "pattern_replace",
          "pattern": "\n",
          "replacement": ""
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"
          ],
          "char_filter": [
            "html_strip"
          ]
        },
        "parsed_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "my_pattern_replace_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "html": {
        "type": "text",
        "analyzer": "my_analyzer",
        "fields": {
          "raw": {
            "type": "text",
            "fielddata": true,
            "analyzer": "parsed_analyzer"
          }
        }
      }
    }
  }
}

Search Query:

POST idx_test/_search

{
  "script_fields": {
    "raw": {
      "script": "doc['html.raw']"
    }
  }, 
  "query": {
    "match": {
      "html": "more"
    }
  },"highlight": {
    "fields": {
      "*":{ "pre_tags" : ["<strong>"], "post_tags" : ["</strong>"] }
    }
  }
}

Result:

"hits": [
    {
        "_index": "idx_test2",
        "_type": "_doc",
        "_id": "GijDsYMBjgX3UBaguGxc",
        "_score": 0.2876821,
        "fields": {
            "raw": [
                "Test More test"
            ]
        },
        "highlight": {
            "html": [
                "<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>"
            ]
        }
    }
]

Result that I want to get:

"hits": [
    {
        "_index": "idx_test2",
        "_type": "_doc",
        "_id": "GijDsYMBjgX3UBaguGxc",
        "_score": 0.2876821,
        "fields": {
            "raw": [
                "Test <strong>More</strong> test"
            ]
        }
]

RabBit_BR · October 9, 2022, 1:15pm

Hi @AmirMohammad_Safari

I thought of another solution. You could index two fields, the original html and the html_extract which has only the text.
You would have to use a processor to just index the text coming from the message and highligths would work.

Mapping

PUT idx_html_strip
{
  "mappings": {
    "properties": {
      "html": {
        "type": "text"
      },
      "html_extract": {
        "type": "text"
      }
    }
  }
}

Processor Pipeline

PUT /_ingest/pipeline/pipe_html_strip
{
  "description": "_description",
  "processors": [
    {
      "html_strip": {
        "field": "html",
        "target_field": "html_extract"
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": "ctx['html_raw'] = ctx['html_raw'].replace('\n',' ').trim()"
      }
    }
  ]
}

Index Data

Note the use ?pipeline=pipe_html_strip

POST idx_html_strip/_doc?pipeline=pipe_html_strip
{
  "html": """<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>"""
}

Query

GET idx_html_strip/_search?filter_path=hits.hits._source,hits.hits.highlight
{
  "query": {
    "multi_match": {
      "query": "More",
      "fields": ["html", "html_extract"]
    }
  },"highlight": {
    "fields": {
      "*":{ "pre_tags" : ["<strong>"], "post_tags" : ["</strong>"] }
    }
  }
}

Results

{
  "hits": {
    "hits": [
      {
        "_source": {
          "html": """<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong>More</strong> test</span></body></html>""",
          "html_extract": "Test More test"
        },
        "highlight": {
          "html": [
            """<html><body><h1 style=\"font-family: Arial\">Test</h1> <span><strong><strong>More</strong></strong> test</span></body>"""
          ],
          "html_extract": [
            "Test <strong>More</strong> test"
          ]
        }
      }
    ]
  }
}

AmirMohammad_Safari · October 9, 2022, 2:11pm

Thanks a lot mate for answering, I tried to reproduce your answer but I got the following error.
Request that I was sent:

PUT /_ingest/pipeline/pipe_html_strip

{
  "description": "_description",
  "processors": [
    {
      "html_strip": {
        "field": "html",
        "target_field": "html_extract"
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": "ctx['html_extract'] = ctx['html_extract'].replace('','').trim()"
      }
    }
  ]
}

Error:

{
    "error": {
        "root_cause": [
            {
                "type": "parse_exception",
                "reason": "No processor type exists with name [html_strip]",
                "processor_type": "html_strip"
            }
        ],
        "type": "parse_exception",
        "reason": "No processor type exists with name [html_strip]",
        "processor_type": "html_strip"
    },
    "status": 400
}

system · November 6, 2022, 2:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Adding html_strip filter Elasticsearch	6	314	December 27, 2022
Highlight fragments of fields that use the html_strip char filter still contain HTML tags Elasticsearch	4	18	August 27, 2024
Elasticsearch- highlighting on both “.keyword” and text fields Elasticsearch	6	3132	December 19, 2018
Highlighting in a a search query Elastic Search	6	281	July 8, 2024
HTML_strip / highlight combo limitations? Elasticsearch	3	857	July 6, 2017

Elasticsearch Highlight the result of script fields

Related Topics