[7.10.2] Querying for "fields" produces HTML entities despite use of "html_strip"

jamesjenkins · August 3, 2021, 3:56pm

We are currently indexing using this analyzer for our text fields:

lowercase_keyword: {
          type: 'custom',
          char_filter: ['html_strip'],
          tokenizer: 'keyword',
          filter: ['asciifolding', 'lowercase'],
        },

Example mapping snippet:

      lastName: {
        type: 'text',
        analyzer: 'lowercase_keyword',
        fields: {
          raw: {
            type: 'keyword',
          },
        },
      },

Using the /_anaylze endpoint to check that analyzer composition with a test string (<p>"The quick brown fox jumps over <strong>the lazy dog.</strong>"</p>) I can see that the expected decoding is taking place:

{
	"tokens": [
		{
			"token": "\n\"the quick brown fox jumps over the lazy dog.\"\n",
			"start_offset": 0,
			"end_offset": 80,
			"type": "word",
			"position": 0
		}
	]
}

I know that using _source will produce the index documents, where I would fully expect to get D"Angelo for lastName in my index, but when I query using fields it was my understanding from reading the documentation that this value would use the mapping and produce the value more inline with what the analyzer example shows.

Hoping to avoid any pre-indexing scrubbing of the data and would appreciate any guidance. Please let me know if I can supplement my examples to help better clarify the problem

spinscale · August 4, 2021, 9:08am

Are you sure that the mapping is properly applied in your index? Can you provide a fully reproducible example including index creation and mapping and document indexing?

Also, there is a processor that strips HTML also in the source, see HTML strip processor | Elasticsearch Guide [7.13] | Elastic

system · September 1, 2021, 9:09am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Html_strip and lowercase on keyword analyzed fields Elasticsearch	1	654	July 5, 2017
HTML Filter - How do I use it in a search? Elasticsearch	5	567	March 16, 2018
Simple question about html stripping Elasticsearch	4	374	July 6, 2017
How to use html_strip Char filter? Elasticsearch	5	1833	July 6, 2017
Strip_HTML on indexing does not store results? Elasticsearch	10	918	July 6, 2017

[7.10.2] Querying for "fields" produces HTML entities despite use of "html_strip"

Related topics