I'm trying to setup a mapping that will use html_strip filter in a custom analyzer to strip html tags before indexing. Setting up the mapping was relatively easy but now I find the only way I can get the filter to work when indexing data is to URL encode the HTML. This just seem wrong, surely I should be able to just send an HTML string in the request body of my index request?
This is my index settings:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"html_analyzer": {
"filter": [
"standard"
],
"char_filter": [
"html_strip"
],
"type":"custom",
"tokenizer" : "standard"
}
}
}
}
}
}
And this is my type mapping:
{
"test_type": {
"mappings": {
"test_type": {
"dynamic": "strict",
"properties": {
"createtime": {
"type": "long"
},
"updatetime": {
"type": "long"
},
"title": {
"type": "string"
},
"description": {
"type": "string"
},
"label": {
"type": "string"
},
"notes": {
"properties": {
"author": {
"properties": {
"firstname": {
"type": "string"
},
"secondname": {
"type": "string"
}
}
},
"category": {
"type": "string"
},
"createdate": {
"type": "date",
"format": "dateOptionalTime"
},
"detail": {
"type": "string",
"analyzer": "html_analyzer"
},
"type": {
"type": "string"
}
}
}
}
}
}
}
}
If I use the analyze endpoint to test it works
GET http://localhost:9200/test2/_analyze?analyzer=html_analyzer&text=this+is+a+note
{
"tokens": [
{
"token": "this",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "is",
"start_offset": 8,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "a",
"start_offset": 15,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "note",
"start_offset": 17,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 4
}
]
}
However, if I then index the same string in a document
PUT http://localhost:9200/test2/test_type/1
{"title": "title1", "_notes": [{"type": "test", "detail": "this <b>is</b> a note"}], "_updatetime": 1438007564786}
And search for 'b'
GET http://localhost:9200/test2/_search?q=b
Then I get a hit from the document
The only way I've managed to get this to work is to send the HTML string URL encoded i.e.
PUT http://localhost:9200/test2/test_type/1
{"title": "title1", "_notes": [{"type": "test", "detail": "This %3Cb%3Eis%3C%5Cb%3E a note"}], "_updatetime": 1438007564800}