We are currently indexing using this analyzer for our text fields:
lowercase_keyword: {
type: 'custom',
char_filter: ['html_strip'],
tokenizer: 'keyword',
filter: ['asciifolding', 'lowercase'],
},
Example mapping snippet:
lastName: {
type: 'text',
analyzer: 'lowercase_keyword',
fields: {
raw: {
type: 'keyword',
},
},
},
Using the /_anaylze
endpoint to check that analyzer composition with a test string (<p>"The quick brown fox jumps over <strong>the lazy dog.</strong>"</p>
) I can see that the expected decoding is taking place:
{
"tokens": [
{
"token": "\n\"the quick brown fox jumps over the lazy dog.\"\n",
"start_offset": 0,
"end_offset": 80,
"type": "word",
"position": 0
}
]
}
I know that using _source
will produce the index documents, where I would fully expect to get D"Angelo
for lastName
in my index, but when I query using fields
it was my understanding from reading the documentation that this value would use the mapping and produce the value more inline with what the analyzer example shows.
Hoping to avoid any pre-indexing scrubbing of the data and would appreciate any guidance. Please let me know if I can supplement my examples to help better clarify the problem