My use case is to add the html_strip char_filter to an existing language analyzer.
For example, I would like to create an index like this one:
PUT /myindex
{
"settings": {
"analysis": {
"analyzer": {
"english_html_strip": {
"type": "english",
"char_filter": [
"html_strip"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"description_english_html": {
"type": "text",
"analyzer": "english_html_strip"
}
}
}
}
}
I then expect the following _analyze
request to both use the english analyzer and to strip html tags.
POST /myindex/_analyze
{
"field": "description_english_html",
"text": "<h1>A header</h1><p>A paragraph</p>"
}
However it doesn't strip html tags, see the output:
{
"tokens": [
{
"token": "h1",
"start_offset": 1,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "header",
"start_offset": 6,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "h1",
"start_offset": 14,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "p",
"start_offset": 18,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "paragraph",
"start_offset": 22,
"end_offset": 31,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "p",
"start_offset": 33,
"end_offset": 34,
"type": "<ALPHANUM>",
"position": 7
}
]
}
Is there a solution, besides rebuilding the language analyzer from scratch?