[7.10.2] Querying for "fields" produces HTML entities despite use of "html_strip"

We are currently indexing using this analyzer for our text fields:

lowercase_keyword: {
          type: 'custom',
          char_filter: ['html_strip'],
          tokenizer: 'keyword',
          filter: ['asciifolding', 'lowercase'],
        },

Example mapping snippet:

      lastName: {
        type: 'text',
        analyzer: 'lowercase_keyword',
        fields: {
          raw: {
            type: 'keyword',
          },
        },
      },

Using the /_anaylze endpoint to check that analyzer composition with a test string (<p>&quot;The quick brown fox jumps over <strong>the lazy dog.</strong>&quot;</p>) I can see that the expected decoding is taking place:

{
	"tokens": [
		{
			"token": "\n\"the quick brown fox jumps over the lazy dog.\"\n",
			"start_offset": 0,
			"end_offset": 80,
			"type": "word",
			"position": 0
		}
	]
}

I know that using _source will produce the index documents, where I would fully expect to get D&quot;Angelo for lastName in my index, but when I query using fields it was my understanding from reading the documentation that this value would use the mapping and produce the value more inline with what the analyzer example shows.

Hoping to avoid any pre-indexing scrubbing of the data and would appreciate any guidance. Please let me know if I can supplement my examples to help better clarify the problem

Are you sure that the mapping is properly applied in your index? Can you provide a fully reproducible example including index creation and mapping and document indexing?

Also, there is a processor that strips HTML also in the source, see HTML strip processor | Elasticsearch Guide [7.13] | Elastic

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.