Keyword datatype and Identifiers

yfful · May 15, 2022, 9:29pm

Hello,

In the "Elastic Engineer" certification course, there is a slide in Module 5 Lesson 3 that is puzzling. The slide is the number 314 and advises to use keyword for identifier fields instead of numbers. An issue was opened in Lucene [LUCENE-10449] Performance regression due to LZ4 compression of TermsDict in SortedSetDocValues - ASF JIRA that relates to that point.

In that issue, keyword fields are shown to have bad performance with identifier-like data, especially when scanning that field with docvalues. The reason is keyword uses a sorted-set docvalues structure, which compresses the data. Identifiers being random, there is not much to compress, hence the performance impact.

I think this kind of data access can happen frequently, e.g., whenever there is an aggregation on that identifier field. Can you shed some light on that advice in the slide, in context of the issue I have commented on ?

Cheers,

dschneiter · May 16, 2022, 12:24pm

Hello @yfful

thanks for sharing bringing up the topic and sharing the information about the performance regression. While I can't answer your question on a technical/implementation level, I can answer it from the perspective of training delivery, and the motivation behind that comment.

When dealing with numeric identifiers users are tempted to map them as numbers. With numbers, it's possible to use specialized queries (like for instance range queries). In order to support that Elasticsearch heeds to maintain complex data structures (BKD trees). If you only have to be able to do exact match queries you are far better off mapping those identifier fields as keyword fields. Keywords are not stored as BKD trees and a "simple" inverted index is suffiient. This is the motivation behind the statement on the slide you mentioned, to avoid the "overhead" of BKD trees if it's not needed.

yfful · May 16, 2022, 3:01pm

Thanks @dschneiter. If I were to index a field using the integer datatype for example, and I disable the index by setting index=false as an option of the mapping: Would I achieve then the same end-goal as the slide's comment, i.e., avoid the kd-trees ? Would it be better or worse than using keyword, performance-wise ?

dschneiter · May 23, 2022, 2:06pm

Hi @yfful, your suggestion goes in a similar direction. I am not a developer and not too familiar with low-level implementation details. I would definitely expect storage requirements to go down, but at the same time, you also would no longer be able to efficiently query for them. The whole idea of storing numeric identfiers as keywords stems from the fact, that you would still be able to execute all "reasonable" query, aggregation, and sorting requests without the overhead introduced by BKD trees.

I'm wondering why you would want to store

system · June 20, 2022, 2:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Optimized keyword fields that only have integers Elasticsearch	7	375	June 29, 2018
Use the correct datatype fields Elasticsearch	5	332	March 24, 2023
ORing a text field with a unique identifier keyword field leading to increased next_doc count and poor performance Elasticsearch	9	207	April 19, 2023
Should I use field datatypes when performing hard matches? Elasticsearch	5	438	July 1, 2018
RAM usage and numeric fields with a limited amount of values and a lot of documents (KD Tree ?) Elasticsearch	1	361	January 1, 2019

Keyword datatype and Identifiers

Related topics