Why do highlight fragments of HTML-stripped fields still contain HTML tags? From all I know based on what the documentation says, I should get the stripped but highlighted text?
Here's what I have.
HTML analyzer
GET /my-index/_settings
shows that I have a standard_html
analyzer with an html_strip
char filter:
{
"customer-portal": {
"settings": {
"index": {
"analysis": {
"analyzer": {
"standard_html": {
"filter": [
"lowercase"
],
"char_filter": [
"html_strip"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}
}
}
The analyzer does work as expected
GET /my-index/_analyze
{
"analyzer": "standard_html",
"text": "</a>Appointment types</h2>"
}
I get two tokens and no HTML tags
{
"tokens": [
{
"token": "appointment",
"start_offset": 4,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "types",
"start_offset": 16,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Indexed HTML field
I have an index that uses the above analyzer to index a content
field
GET /my-index/_mapping
{
"customer-portal": {
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "standard_html"
}
}
}
}
}
Single HTML document in index
POST /my-index/_doc
{
"content": "<p>This is a <strong>superduper</strong> test.</p>"
}
Failed highlighting
AFAIU a highlighted fragment for the content
field should NOT contain any HTML tags besides the ones defined through the pre/post tags property. However, that's not what I see.
GET /my-index/_search
{
"query": {
"match": {
"content": "superduper"
}
},
"highlight": {
"fields": {
"content": {}
}
}
}
The <p>
tags in _source.content
is expected - the same in highlight.content
isn't.
{
"hits": {
"hits": [
{
"_source": {
"content": "<p>This is a <strong>superduper</strong>.</p>"
},
"highlight": {
"content": [
"<p>This is a <strong><em>superduper</strong></em>.</p>"
]
}
}
]
}
}
In my real-life index the effect of this behavior is much worse because the highlight fragment obviously might contain invalid HTML (open and/or close tags missing). Examples:
"""<div class="paragraph">
<p>In order to connect with Microsoft <em>Graph</em> to read/write calendar entries,"""
"""="5"></i><b>5</b></td>
<td>The maximum number of items to return for requests to the Microsoft <em>Graph</em>"""
"""class="fa icon-tip" title="Tip"></i></td>
<td class="content">To analyze issues related to the <em>Graph</em>"""