With the release of Elasticsearch 1.4, I discovered about doc_values.
However their use remains a bit obscure for me, and the documentation
didn't help. As I understand it, it is mostly useful when performing
aggregations, as it allows to reduce the memory amount of data loaded in
memory. The doc recommends to specify not_analyzed string as doc_values, as
well as values that are used for aggregations. But for instance, if my
aggregations are about summing one value called "amount", does it make the
"amount" integer/double as a good candidate for doc_values, or is it only
useful for properties that are space consuming?
In overall, if my use case is nearly only aggregations, should I go the way
of setting all proeprties as doc_values, except the analyzed strings?
With the release of Elasticsearch 1.4, I discovered about doc_values.
However their use remains a bit obscure for me, and the documentation
didn't help. As I understand it, it is mostly useful when performing
aggregations, as it allows to reduce the memory amount of data loaded in
memory. The doc recommends to specify not_analyzed string as doc_values, as
well as values that are used for aggregations. But for instance, if my
aggregations are about summing one value called "amount", does it make the
"amount" integer/double as a good candidate for doc_values, or is it only
useful for properties that are space consuming?
In overall, if my use case is nearly only aggregations, should I go the
way of setting all proeprties as doc_values, except the analyzed strings?
I'm using a machine with fast SSD so it should be ok.
I'm not sure to udnerstand the downside "you cannot filter". What do you
mean exactly? I tried to do a range filter on the value, and it indeed
works!
Le mardi 7 octobre 2014 16:23:41 UTC+2, Ivan Brusic a écrit :
Perhaps it is easier to talk about the downsides of doc_values.
If you have slow disks, common when using low level VMs with shared disks,
then retrieving your data will be much slower.
Also, you cannot filter on doc_values fields, so it depends on your other
use cases.
The amount field seems like a good candidate for doc_values, but it
depends on the downsides I highlighted above.
Cheers,
Ivan
On Oct 7, 2014 6:11 AM, "Michaël Gallego" <mic...@maestrooo.com
<javascript:>> wrote:
Hi,
With the release of Elasticsearch 1.4, I discovered about doc_values.
However their use remains a bit obscure for me, and the documentation
didn't help. As I understand it, it is mostly useful when performing
aggregations, as it allows to reduce the memory amount of data loaded in
memory. The doc recommends to specify not_analyzed string as doc_values, as
well as values that are used for aggregations. But for instance, if my
aggregations are about summing one value called "amount", does it make the
"amount" integer/double as a good candidate for doc_values, or is it only
useful for properties that are space consuming?
In overall, if my use case is nearly only aggregations, should I go the
way of setting all proeprties as doc_values, except the analyzed strings?
Doc values just mean that field data will be computed at indexing time and
stored on disk as opposed to computed at search time (which is the
default). You can expect them to be slightly slower, but they have
significant benefits too: their memory footprint is low, since most stuff
is stored on disk. In particular this will help with other issues such as
garbage collection. And since they are computed at indexing time, you don't
have the same cold start issue that you can have with in-memory fielddata
since recomputing them from the inverted index is both CPU and I/O
intensive. We initially added doc values support in 1.0 but they got better
and better with each release (especially in the forthcoming 1.4 release)
and we are even thinking about making them the default in a future release
(nothing decided yet, just thinking about it).
With the release of Elasticsearch 1.4, I discovered about doc_values.
However their use remains a bit obscure for me, and the documentation
didn't help. As I understand it, it is mostly useful when performing
aggregations, as it allows to reduce the memory amount of data loaded in
memory. The doc recommends to specify not_analyzed string as doc_values, as
well as values that are used for aggregations. But for instance, if my
aggregations are about summing one value called "amount", does it make the
"amount" integer/double as a good candidate for doc_values, or is it only
useful for properties that are space consuming?
In overall, if my use case is nearly only aggregations, should I go the
way of setting all proeprties as doc_values, except the analyzed strings?
I've tried to create two types and fill them with data that mimic my own
data, both with doc_values for all non-analyzed string and numeric fields
and without doc_values. My additional question is to know if there are some
overhead of mixing both doc_values and non-doc values inside the same
document, or it does not matter?
On my small machine and Elasticsearch 1.3, doc_values is approximately 25%
slower, if 1.4 makes this much better, I think I'll use them. However, the
question is: how can I actually benchmark this? I didn't find any way to
profile my aggregation queries to see things like memory consumption taken
by one query. Without this information, how can I know some important
metrics that would help me to choose between doc_values and without
doc_values?
It is perfectly fine to have some fields that have doc values and other
fields that don't.
Doc values are indeed faster in 1.4, especially on numeric fields. Using
doc values doesn't change the memory usage of aggregations. However, it
will change the memory usage for field data. It's a bit tricky to track
today because doc values are not accounted in "fielddata" but in the
"segments"[1] memory usage.
I've tried to create two types and fill them with data that mimic my own
data, both with doc_values for all non-analyzed string and numeric fields
and without doc_values. My additional question is to know if there are some
overhead of mixing both doc_values and non-doc values inside the same
document, or it does not matter?
On my small machine and Elasticsearch 1.3, doc_values is approximately 25%
slower, if 1.4 makes this much better, I think I'll use them. However, the
question is: how can I actually benchmark this? I didn't find any way to
profile my aggregation queries to see things like memory consumption taken
by one query. Without this information, how can I know some important
metrics that would help me to choose between doc_values and without
doc_values?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.