Hi,
Here are some script to reproduce what we experiment.
Index creation with 2 custom Analyzers :
- One used to index all fields -> IndexAnalyzer
- One used to search -> MyStandardSearchAnalyzer
These 2 analyzers have the char_filter "html_strip"
PUT /test-index
{
"settings": {
"analysis": {
"analyzer": {
"MyStandardSearchAnalyzer": {
"type": "custom",
"char_filter": ["html_strip"],
"filter": ["lowercase", "asciifolding"],
"tokenizer": "standard"
},
"IndexAnalyzer": {
"type": "custom",
"char_filter": ["html_strip"],
"filter": ["lowercase", "asciifolding"],
"tokenizer": "my_ngramTokenizer"
}
},
"tokenizer": {
"my_ngramTokenizer": {
"min_gram": 3,
"max_gram": 36,
"type": "ngram"
}
}
}
},
"mappings": {
"mytype": {
"dynamic_templates": [
{
"defaultTemplate": {
"match": "*",
"mapping": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
},
"analyzer": "IndexAnalyzer",
"search_analyzer": "MyStandardSearchAnalyzer"
}
}
}
]
}
}
}
Add one document which contains not-escaped html, :
PUT /test-index/mytype/1
{
"field1": "<img src='x' onerror='alert('dangerous attack')'/>",
"field2": "<span>text inside html tag and some special chars èéà</span>"
}
Search this document:
POST /test-index/mytype/_search?typed_keys=true
{
"from": 0,
"size": 100
}
The search result contains the HTML tags :
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test-index",
"_type": "mytype",
"_id": "1",
"_score": 1,
"_source": {
"field1": "<img src='x' onerror='alert('dangerous attack')'/>",
"field2": "<span>text inside html tag and some special chars èéà</span>"
}
}
]
}
}
I understand that this is the normal behaviour of ElasticSearch, and its not a problem if you don't use highlighting feature.
Is there a way to remove html tags from this reponse built-in elastic search ?
We tried to escape html tags to make the "alert('dangerous attack')" harmless, but the problem is that the special chars are stored escaped, and thats where the asciifolding search is not working anymore (which is normal too)
I hope this is clear now, if not don't hesitate to ask question !
Best regards,
Valentin