Keyword datatype and Identifiers


In the "Elastic Engineer" certification course, there is a slide in Module 5 Lesson 3 that is puzzling. The slide is the number 314 and advises to use keyword for identifier fields instead of numbers. An issue was opened in Lucene [LUCENE-10449] Performance regression due to LZ4 compression of TermsDict in SortedSetDocValues - ASF JIRA that relates to that point.

In that issue, keyword fields are shown to have bad performance with identifier-like data, especially when scanning that field with docvalues. The reason is keyword uses a sorted-set docvalues structure, which compresses the data. Identifiers being random, there is not much to compress, hence the performance impact.

I think this kind of data access can happen frequently, e.g., whenever there is an aggregation on that identifier field. Can you shed some light on that advice in the slide, in context of the issue I have commented on ?


Hello @yfful

thanks for sharing bringing up the topic and sharing the information about the performance regression. While I can't answer your question on a technical/implementation level, I can answer it from the perspective of training delivery, and the motivation behind that comment.

When dealing with numeric identifiers users are tempted to map them as numbers. With numbers, it's possible to use specialized queries (like for instance range queries). In order to support that Elasticsearch heeds to maintain complex data structures (BKD trees). If you only have to be able to do exact match queries you are far better off mapping those identifier fields as keyword fields. Keywords are not stored as BKD trees and a "simple" inverted index is suffiient. This is the motivation behind the statement on the slide you mentioned, to avoid the "overhead" of BKD trees if it's not needed.

1 Like

Thanks @dschneiter. If I were to index a field using the integer datatype for example, and I disable the index by setting index=false as an option of the mapping: Would I achieve then the same end-goal as the slide's comment, i.e., avoid the kd-trees ? Would it be better or worse than using keyword, performance-wise ?

Hi @yfful, your suggestion goes in a similar direction. I am not a developer and not too familiar with low-level implementation details. I would definitely expect storage requirements to go down, but at the same time, you also would no longer be able to efficiently query for them. The whole idea of storing numeric identfiers as keywords stems from the fact, that you would still be able to execute all "reasonable" query, aggregation, and sorting requests without the overhead introduced by BKD trees.

I'm wondering why you would want to store

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.