Understanding doc_values?

Hi,

With the release of Elasticsearch 1.4, I discovered about doc_values.
However their use remains a bit obscure for me, and the documentation
didn't help. As I understand it, it is mostly useful when performing
aggregations, as it allows to reduce the memory amount of data loaded in
memory. The doc recommends to specify not_analyzed string as doc_values, as
well as values that are used for aggregations. But for instance, if my
aggregations are about summing one value called "amount", does it make the
"amount" integer/double as a good candidate for doc_values, or is it only
useful for properties that are space consuming?

In overall, if my use case is nearly only aggregations, should I go the way
of setting all proeprties as doc_values, except the analyzed strings?

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0d204979-3403-4b07-9782-c4b52120f7e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Perhaps it is easier to talk about the downsides of doc_values.

If you have slow disks, common when using low level VMs with shared disks,
then retrieving your data will be much slower.

Also, you cannot filter on doc_values fields, so it depends on your other
use cases.

The amount field seems like a good candidate for doc_values, but it depends
on the downsides I highlighted above.

Cheers,

Ivan
On Oct 7, 2014 6:11 AM, "Michaël Gallego" michael@maestrooo.com wrote:

Hi,

With the release of Elasticsearch 1.4, I discovered about doc_values.
However their use remains a bit obscure for me, and the documentation
didn't help. As I understand it, it is mostly useful when performing
aggregations, as it allows to reduce the memory amount of data loaded in
memory. The doc recommends to specify not_analyzed string as doc_values, as
well as values that are used for aggregations. But for instance, if my
aggregations are about summing one value called "amount", does it make the
"amount" integer/double as a good candidate for doc_values, or is it only
useful for properties that are space consuming?

In overall, if my use case is nearly only aggregations, should I go the
way of setting all proeprties as doc_values, except the analyzed strings?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0d204979-3403-4b07-9782-c4b52120f7e9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0d204979-3403-4b07-9782-c4b52120f7e9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC9HouWNV8qmGaHXfR%2BuTbojybD%3DBYpT9woMadnMEHdaA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I'm using a machine with fast SSD so it should be ok.

I'm not sure to udnerstand the downside "you cannot filter". What do you
mean exactly? I tried to do a range filter on the value, and it indeed
works!

Le mardi 7 octobre 2014 16:23:41 UTC+2, Ivan Brusic a écrit :

Perhaps it is easier to talk about the downsides of doc_values.

If you have slow disks, common when using low level VMs with shared disks,
then retrieving your data will be much slower.

Also, you cannot filter on doc_values fields, so it depends on your other
use cases.

The amount field seems like a good candidate for doc_values, but it
depends on the downsides I highlighted above.

Cheers,

Ivan
On Oct 7, 2014 6:11 AM, "Michaël Gallego" <mic...@maestrooo.com
<javascript:>> wrote:

Hi,

With the release of Elasticsearch 1.4, I discovered about doc_values.
However their use remains a bit obscure for me, and the documentation
didn't help. As I understand it, it is mostly useful when performing
aggregations, as it allows to reduce the memory amount of data loaded in
memory. The doc recommends to specify not_analyzed string as doc_values, as
well as values that are used for aggregations. But for instance, if my
aggregations are about summing one value called "amount", does it make the
"amount" integer/double as a good candidate for doc_values, or is it only
useful for properties that are space consuming?

In overall, if my use case is nearly only aggregations, should I go the
way of setting all proeprties as doc_values, except the analyzed strings?

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0d204979-3403-4b07-9782-c4b52120f7e9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0d204979-3403-4b07-9782-c4b52120f7e9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/45292b0b-659e-418c-9eee-e6d3bff23abf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Michaël,

Doc values just mean that field data will be computed at indexing time and
stored on disk as opposed to computed at search time (which is the
default). You can expect them to be slightly slower, but they have
significant benefits too: their memory footprint is low, since most stuff
is stored on disk. In particular this will help with other issues such as
garbage collection. And since they are computed at indexing time, you don't
have the same cold start issue that you can have with in-memory fielddata
since recomputing them from the inverted index is both CPU and I/O
intensive. We initially added doc values support in 1.0 but they got better
and better with each release (especially in the forthcoming 1.4 release)
and we are even thinking about making them the default in a future release
(nothing decided yet, just thinking about it).

On Tue, Oct 7, 2014 at 5:19 PM, Michaël Gallego michael@maestrooo.com
wrote:

I'm using a machine with fast SSD so it should be ok.

I'm not sure to udnerstand the downside "you cannot filter". What do you
mean exactly? I tried to do a range filter on the value, and it indeed
works!

Le mardi 7 octobre 2014 16:23:41 UTC+2, Ivan Brusic a écrit :

Perhaps it is easier to talk about the downsides of doc_values.

If you have slow disks, common when using low level VMs with shared
disks, then retrieving your data will be much slower.

Also, you cannot filter on doc_values fields, so it depends on your other
use cases.

The amount field seems like a good candidate for doc_values, but it
depends on the downsides I highlighted above.

Cheers,

Ivan
On Oct 7, 2014 6:11 AM, "Michaël Gallego" mic...@maestrooo.com wrote:

Hi,

With the release of Elasticsearch 1.4, I discovered about doc_values.
However their use remains a bit obscure for me, and the documentation
didn't help. As I understand it, it is mostly useful when performing
aggregations, as it allows to reduce the memory amount of data loaded in
memory. The doc recommends to specify not_analyzed string as doc_values, as
well as values that are used for aggregations. But for instance, if my
aggregations are about summing one value called "amount", does it make the
"amount" integer/double as a good candidate for doc_values, or is it only
useful for properties that are space consuming?

In overall, if my use case is nearly only aggregations, should I go the
way of setting all proeprties as doc_values, except the analyzed strings?

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0d204979-3403-4b07-9782-c4b52120f7e9%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0d204979-3403-4b07-9782-c4b52120f7e9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/45292b0b-659e-418c-9eee-e6d3bff23abf%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/45292b0b-659e-418c-9eee-e6d3bff23abf%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6p5Z1k6U9LNNzeXXwM-Y%2BG%2BuNC%3DAYxbBOS1P%3DHZBN_%3Dg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi again,

I've tried to create two types and fill them with data that mimic my own
data, both with doc_values for all non-analyzed string and numeric fields
and without doc_values. My additional question is to know if there are some
overhead of mixing both doc_values and non-doc values inside the same
document, or it does not matter?

On my small machine and Elasticsearch 1.3, doc_values is approximately 25%
slower, if 1.4 makes this much better, I think I'll use them. However, the
question is: how can I actually benchmark this? I didn't find any way to
profile my aggregation queries to see things like memory consumption taken
by one query. Without this information, how can I know some important
metrics that would help me to choose between doc_values and without
doc_values?

Thanks a lot!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fee91837-0f29-4268-9391-c3581ce4b70b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

It is perfectly fine to have some fields that have doc values and other
fields that don't.

Doc values are indeed faster in 1.4, especially on numeric fields. Using
doc values doesn't change the memory usage of aggregations. However, it
will change the memory usage for field data. It's a bit tricky to track
today because doc values are not accounted in "fielddata" but in the
"segments"[1] memory usage.

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-segments.html

On Thu, Oct 9, 2014 at 2:13 PM, Michaël Gallego michael@maestrooo.com
wrote:

Hi again,

I've tried to create two types and fill them with data that mimic my own
data, both with doc_values for all non-analyzed string and numeric fields
and without doc_values. My additional question is to know if there are some
overhead of mixing both doc_values and non-doc values inside the same
document, or it does not matter?

On my small machine and Elasticsearch 1.3, doc_values is approximately 25%
slower, if 1.4 makes this much better, I think I'll use them. However, the
question is: how can I actually benchmark this? I didn't find any way to
profile my aggregation queries to see things like memory consumption taken
by one query. Without this information, how can I know some important
metrics that would help me to choose between doc_values and without
doc_values?

Thanks a lot!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fee91837-0f29-4268-9391-c3581ce4b70b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/fee91837-0f29-4268-9391-c3581ce4b70b%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5EM1bjd1FZgaGN_NqR6xRSHOw35AFSwgq1q4JiGbEx%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.