Is KEYWORD data type analyzed as well?


(Kasia) #1

Hi,
I thought inverted indexes are being created for all the fields, independently on their data types, but only TEXT fields are being analyzed before indexing... but now, after reading the documentation (Reference 5.1) about the truncate token filter, I'm pretty confused...

This is what the REFERENCE 5.1 says:

The truncate token filter can be used to truncate tokens into a specific length.
This can come in handy with keyword (single token) based mapped fields that are used for sorting in order to reduce memory usage.

The truncate token filter, as any token filter, is supposed to be part of an analyzer, right?
Does it mean that keyword mapped fields are being analyzed as well?
If so, where do I set an analyzer? The keyword mapping only allows search_analyzer to be set...

Please, explain how the internals look like?
Are there inverted indexes being created for each field marked as indexed (index=true) in the mapping, indeed?
What's the type of the entries of the inverted index if this is a DATE or any-NUMBER field that has been indexed?

Thanks in advance,
Kasia


(Colin Goodheart-Smithe) #2

keyword fields in 5.0 are effectively analyzed using the keyword analyzer which takes the value of the field and create a single token for the index whose text is the value of the field. At the moment, this behavbiour cannot be changed, but we do have an issue for adding the ability to set a "normaliser" to keyword fields to allow some customisation on how the token is created (such as lowercasing the value). The limitation is that the "normaliser" should always result in a single token for a given field value. the issue is here: https://github.com/elastic/elasticsearch/pull/21919

The part of the documentation is admittedly confusing since it is mixing the keyword analyzer (which can be used on text fields) with keyword field types on which you can't specify an analyzer. I have opened https://github.com/elastic/elasticsearch/issues/22650 to correct the wording here


(Kasia) #3

Hi Collin,
Thank's for your explanation.
If I understood well, you'e using keyword analyzer for keyword fields internally, but the keyword analyzer itself cannot be configured, right? To add a normalizer, I would need to configure a customized analyzer using a keyword tokenizer and some normalizing token filters... and in such a case an "analyzer" parameter would have to be enabled for the keyword fields and this is what you are working on, right?

Nonetheless, I still don't get the example of when the truncate token filter could be useful.
The documentation mentions sorting, but... shouldn't I rather use the doc_values if I were to sort on a keyword field? Or maybe you mean, that if I persist for some reason in sorting on a text field, truncating tokens would be helpful (but harmfull, I suppose, for the search case...), don't you?

Best regards,
Kasia


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.