Escaping html in elastic search response

Hi everyone,

To enable highlighting in elasticsearch search result and avoid javascript injection, we decided to escape special characters before indexing them. But doing this, we have remarked that ascii folding feature is not supported anymore. So, finding the word "été" by tiping "ete" is not possible, as the indexed word is "été"

Is there a way to avoid this behavior? Is it possible that elastic search escape search result but document are indexed in 'raw' format?

Note: we tried to use "html_strip" charFilter, both at index time and at search time, but the document still contains html tags.

Best Regards,
Valentin

Why this? Could you explain what you did exactly?

Hello David,
Thanks for your fast reply.

As I tried to explain with my example, the fact that we index the word 'été' as "& eacute;t& eacute;"* make the ascii folding search impossible.

I hope it is more clear now.

I can post my mapping if you want but the problem is not there, the ascii folding search work well if we does not escape special chars.

*I added some spaces to make it visible, it was translated in the first post, my bad

lol... Next time you can use </> or markdown format to properly format your code. So &eacute;t&eacute;.

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

Specifically your tests around the html_strip char filter.

Hi,
Here are some script to reproduce what we experiment.

Index creation with 2 custom Analyzers :

  • One used to index all fields -> IndexAnalyzer
  • One used to search -> MyStandardSearchAnalyzer

These 2 analyzers have the char_filter "html_strip"

PUT /test-index
{
	"settings": {
		"analysis": {
			"analyzer": {
				"MyStandardSearchAnalyzer": {
					"type": "custom",
					"char_filter": ["html_strip"],
					"filter": ["lowercase", "asciifolding"],
					"tokenizer": "standard"
				},
				"IndexAnalyzer": {
					"type": "custom",
					"char_filter": ["html_strip"],
					"filter": ["lowercase", "asciifolding"],
					"tokenizer": "my_ngramTokenizer"
				}
			},
			"tokenizer": {
				"my_ngramTokenizer": {
					"min_gram": 3,
					"max_gram": 36,
					"type": "ngram"
				}
			}
		}
	},
	"mappings": {
		"mytype": {
			"dynamic_templates": [
				{
					"defaultTemplate": {
						"match": "*",
						"mapping": {
							"type": "text",
							"fields": {
								"raw": {
									"type": "keyword"
								}
							},
							"analyzer": "IndexAnalyzer",
                            "search_analyzer": "MyStandardSearchAnalyzer"
						}
					}
				}
			]
		}
	}
}

Add one document which contains not-escaped html, :

PUT /test-index/mytype/1
{
	"field1": "<img src='x' onerror='alert('dangerous attack')'/>",
	"field2": "<span>text inside html tag and some special chars èéà</span>"
}

Search this document:

POST /test-index/mytype/_search?typed_keys=true
{
	"from": 0,
	"size": 100
}

The search result contains the HTML tags :

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "test-index",
        "_type": "mytype",
        "_id": "1",
        "_score": 1,
        "_source": {
          "field1": "<img src='x' onerror='alert('dangerous attack')'/>",
          "field2": "<span>text inside html tag and some special chars èéà</span>"
        }
      }
    ]
  }
} 

I understand that this is the normal behaviour of ElasticSearch, and its not a problem if you don't use highlighting feature.

Is there a way to remove html tags from this reponse built-in elastic search ?

We tried to escape html tags to make the "alert('dangerous attack')" harmless, but the problem is that the special chars are stored escaped, and thats where the asciifolding search is not working anymore (which is normal too)

I hope this is clear now, if not don't hesitate to ask question !

Best regards,
Valentin

I post to avoid this topic to close automatically

The only way to do that is to alter the _source before it gets indexed.
So either do that in your application before sending it to Elasticsearch or use an Ingest Pipeline to do it. Not sure which processor you can use though. May be a Script Processor with your own Painless Script.

Running out of ideas. :frowning_face:

Ok, thanks.

We implemented a workaround where only "<" character is html encoded, this way there is nothing that is interpreted as html anymore and we don't lose any ES fonctionality.

Best regards,
Valentin