I'm doing some overall testing on my cluster, debating if I should switch
to Doc Values. I have about 15 fields for each document, with 83 million
documents spread across 60 indices. All the fields are dynamically mapped,
and all of them can migrate to Doc Values. So, I have one copy of the data
using FDC and a second copy using DV. Overall it's a 3x increase in
consumed disk space, and a 98% decrease in FDC size when using DV.
My question is, what is that last left over 2%? If everything is on disk,
why is it reporting memory usage in the FDC? Some indices report 0 bytes,
but others report anywhere between 34 KB - 700 KB? What am I missing here?
Are things still loaded into the FDC anyway? Maybe I missed a field type
in the dynamic templates?
Also, what field types are recommended to move to Doc Values?
High-cardinality non-analyzed string fields? High-cardinality in general?
Everything?
Overall performance seems similar, but this is just one of quite a few
data-sets that would be interacted with at any given time and hopefully I
have less memory issues (GC/eviction).
All fields that you search on and aggregate on should be moved to doc
values in my opinion. By the way, elasticsearc 2.0 will make doc values on
by default except on analyzed string fields.
We still need some fielddata memory for something called the "global
ordinal map". When you have string fields, we typically do computations
using the ordinals of these strings instead of their actual values. However
these ordinals only exist per segment, and sometimes we would like them to
be consistent across an entire shard, so we build this global ordinal map
that stores this mapping and keep it around.
Thanks for the clarification Adrien. If that's the case, is there such a
flag that can enable them by default for all fields (excluding non-analyzed
strings; using ~1.4.3 here)?
Also, do you guys have more performance metrics on using Doc Values vs FDC?
I've seen the "10-25%" slower value thrown around, but I wanted to know
what that was tested with (CPU, mem, spinning vs. SSD, etc...) and where
gains may be had.
Thanks for the clarification Adrien. If that's the case, is there such a
flag that can enable them by default for all fields (excluding non-analyzed
strings; using ~1.4.3 here)?
Also, do you guys have more performance metrics on using Doc Values vs
FDC? I've seen the "10-25%" slower value thrown around, but I wanted to
know what that was tested with (CPU, mem, spinning vs. SSD, etc...) and
where gains may be had.
in my debugging the current differences are usually the cost of a
predictable branch (bounds check), coming from ByteBuffer.get(). For
fielddata it uses simple java arrays, and today the java compiler can do
optimizations to remove the checks more easily in that case.
But IMO benchmarking here is usually not done correctly, it doesn't
consider the impact of having such huge badly-compressed data in heap
memory, e.g. impacts on GC and other problems people have. So i recommend
doing a test with real data and real workloads
@rmuir Interesting, it sounds like my gains may be better than previously
expected, given the server is constantly evicting from heap. If I'm able,
I'll post some performance metrics back here when I'm done.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.