Are ENUM stored efficiently?

champtar · October 19, 2016, 9:15am

Hi all,

I'm playing with elasticsearch 5rc1 to index some http access log, and i'm wondering if "ENUM" (finite/low cardinality fields) are stored efficiently (and how)?

For exemple i'm indexing the http request verb (GET,POST,PUT,...) with type keyword, and i would like to know:
-how many times is an unique string written on disk (once per line/segment/shard/index)?
-is there any specific setting to help elasticsearch?
-Is it worth it to manually map it to a short (index/search performance, index size)?

Thanks in advance
Etienne

danielmitterdorfer · October 19, 2016, 9:40am

Hi @champtar,

they are stored as doc values internally and the best in-depth description I know about is in the Definitive Guide.

To quote specifically the part about strings (that you care about):

Strings are encoded [...] with the help of an ordinal table. The strings are de-duplicated and sorted into a table, assigned an ID, and then those ID’s are used as numeric doc values.

So strings are already stored as efficient as possible and there is no need for you to worry about encoding them differently.

Daniel

champtar · October 19, 2016, 9:58am

thanks @danielmitterdorfer

nik9000 · October 19, 2016, 12:01pm

So there are three common ways things can end up on disk:

Inverted index
Stored fields
Doc values

There are others but they are more rare. Doc values work as @danielmitterdorfer explained with the ordinals. Stored fields are combined in chunks with other documents and compressed. Usually you don't interact with stored fields directly, but you interact with _source which is stored. The inverted index also has one copy copy of the text per segment.

So in an index with a billion docs you'd see a small compression by manually switching, but not huge. You'd save a tiny bit of space per chunk of stored fields and that'd add up, but probably not enough to be worth it. You'd save an even smaller amount of space in the inverted index as well, but again, I don't expect they'd add up to be worth it.

You should make sure that those strings are indexes as "index": "not_analyzed" if on 2.x or "type": "keyword" if you are trying out 5.0. That gives you the most useful behavior for strings you don't want to analyze.

One time it is more obviously worth it to convert strings like GET/PUT/POST etc into ordinals manually is if you want to use a range query or sort them non-alphabetically. There just aren't that many HTTP verbs so it probably isn't worth it for a field like that, but maybe it'd make more sense on field with 500 values.

Topic		Replies	Views
Store enum as keyword VS integer Elasticsearch	2	1736	January 5, 2021
How does Elasticsearch indexes non-text fields Elasticsearch	5	732	September 25, 2022
Should I store strings directly or their numeric tokens in elasticsearch Elasticsearch	5	570	September 23, 2017
How does elasticsearch store repeated values across documents? Elasticsearch	4	1084	March 30, 2018
Is the host field data being stored in an efficient manner? Elasticsearch	3	467	June 4, 2019

Are ENUM stored efficiently?

Related topics