I'm playing with elasticsearch 5rc1 to index some http access log, and i'm wondering if "ENUM" (finite/low cardinality fields) are stored efficiently (and how)?
For exemple i'm indexing the http request verb (GET,POST,PUT,...) with type keyword, and i would like to know:
-how many times is an unique string written on disk (once per line/segment/shard/index)?
-is there any specific setting to help elasticsearch?
-Is it worth it to manually map it to a short (index/search performance, index size)?
they are stored as doc values internally and the best in-depth description I know about is in the Definitive Guide.
To quote specifically the part about strings (that you care about):
Strings are encoded [...] with the help of an ordinal table. The strings are de-duplicated and sorted into a table, assigned an ID, and then those ID’s are used as numeric doc values.
So strings are already stored as efficient as possible and there is no need for you to worry about encoding them differently.
So there are three common ways things can end up on disk:
Inverted index
Stored fields
Doc values
There are others but they are more rare. Doc values work as @danielmitterdorfer explained with the ordinals. Stored fields are combined in chunks with other documents and compressed. Usually you don't interact with stored fields directly, but you interact with _source which is stored. The inverted index also has one copy copy of the text per segment.
So in an index with a billion docs you'd see a small compression by manually switching, but not huge. You'd save a tiny bit of space per chunk of stored fields and that'd add up, but probably not enough to be worth it. You'd save an even smaller amount of space in the inverted index as well, but again, I don't expect they'd add up to be worth it.
You should make sure that those strings are indexes as "index": "not_analyzed" if on 2.x or "type": "keyword" if you are trying out 5.0. That gives you the most useful behavior for strings you don't want to analyze.
One time it is more obviously worth it to convert strings like GET/PUT/POST etc into ordinals manually is if you want to use a range query or sort them non-alphabetically. There just aren't that many HTTP verbs so it probably isn't worth it for a field like that, but maybe it'd make more sense on field with 500 values.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.