Doc values for field data


(David Smith-2) #1

When you map fields to use doc values for field data, does that limit the functionality afforded to those fields to merely sorting and aggregations/faceting?

The documentation mentions that filtering is not supported by numeric or string types when stored as doc values. Yikes, I thought that doc values is intended for working with field data when it's too large to load into memory. Is that not the case?

I read both of the following pages but I'm not sure I quite understand where the usefulness of field data fields kick in.


http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html

Can someone please clarify?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/FC9E6ECA-B869-4B40-B2C8-F55CE6AB6790%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #2

Hi David,

Doc values are a way to compute field data at indexing time, and to store
it on disk. It can do everything that "uninverted" field data can do:
aggregations, sorting, etc. However, it never kicks in automatically: it
needs to be configured explicitely, and can only be set at index creation
time, you cannot enable it afterwards.

Regarding fielddata filtering, it is a way to trade accuracy for memory by
only loading "important" terms into memory and doesn't work with doc values
since it's not useful given that they are stored on disk anyway (and thus
don't require much memory).

Does it clarify?

On Mon, Jul 14, 2014 at 7:26 PM, David K Smith davidksmith2k@gmail.com
wrote:

When you map fields to use doc values for field data, does that limit the
functionality afforded to those fields to merely sorting and
aggregations/faceting?

The documentation mentions that filtering is not supported by numeric or
string types when stored as doc values. Yikes, I thought that doc values is
intended for working with field data when it's too large to load into
memory. Is that not the case?

I read both of the following pages but I'm not sure I quite understand
where the usefulness of field data fields kick in.

http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html

Can someone please clarify?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/FC9E6ECA-B869-4B40-B2C8-F55CE6AB6790%40gmail.com
https://groups.google.com/d/msgid/elasticsearch/FC9E6ECA-B869-4B40-B2C8-F55CE6AB6790%40gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7a68-vDManD3C_TUXhB6jQxePhNcYx5VeFfku5AxuO2A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(David Smith-2) #3

Thanks, Adrien. That brings me closer.

So when the documentations say doc values do not support filtering, it's
talking about fielddata filtering for what's loaded into memory (anod not
filtering as part of a query... say term filter). For further clarification

  • can a field that is not analyzed and only kept as doc values be used for
    querying/filtering (say a term filter on a numeric field or match query on
    a string field)? Or do all querying/filtering required the field to be in
    the uninverted index?

What I'm trying to understand how we can optimize querying/filtering in a
large index (5 billion documents / 1 TB)? It's very hard to run a simple
term filter because a bitset filter will need to be calculated that
includes every single document. Wouldn't that utilize a lot of memory? Is
there a way to speed that up?

On Tue, Jul 15, 2014 at 6:30 AM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Hi David,

Doc values are a way to compute field data at indexing time, and to store
it on disk. It can do everything that "uninverted" field data can do:
aggregations, sorting, etc. However, it never kicks in automatically: it
needs to be configured explicitely, and can only be set at index creation
time, you cannot enable it afterwards.

Regarding fielddata filtering, it is a way to trade accuracy for memory by
only loading "important" terms into memory and doesn't work with doc values
since it's not useful given that they are stored on disk anyway (and thus
don't require much memory).

Does it clarify?

On Mon, Jul 14, 2014 at 7:26 PM, David K Smith davidksmith2k@gmail.com
wrote:

When you map fields to use doc values for field data, does that limit the
functionality afforded to those fields to merely sorting and
aggregations/faceting?

The documentation mentions that filtering is not supported by numeric or
string types when stored as doc values. Yikes, I thought that doc values is
intended for working with field data when it's too large to load into
memory. Is that not the case?

I read both of the following pages but I'm not sure I quite understand
where the usefulness of field data fields kick in.

http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html

Can someone please clarify?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/FC9E6ECA-B869-4B40-B2C8-F55CE6AB6790%40gmail.com
https://groups.google.com/d/msgid/elasticsearch/FC9E6ECA-B869-4B40-B2C8-F55CE6AB6790%40gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7a68-vDManD3C_TUXhB6jQxePhNcYx5VeFfku5AxuO2A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7a68-vDManD3C_TUXhB6jQxePhNcYx5VeFfku5AxuO2A%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKoSUN87jc2nt8H07M%2BBxQuUcKCQPtsxdSL9S1Nf0cFe17EBFA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #4

On Tue, Jul 15, 2014 at 3:25 PM, David Smith davidksmith2k@gmail.com
wrote:

Thanks, Adrien. That brings me closer.

So when the documentations say doc values do not support filtering, it's
talking about fielddata filtering for what's loaded into memory (anod not
filtering as part of a query... say term filter).

Exactly.

For further clarification - can a field that is not analyzed and only kept
as doc values be used for querying/filtering (say a term filter on a
numeric field or match query on a string field)? Or do all
querying/filtering required the field to be in the uninverted index?

Doc values play no role when filtering (except for some filters that
support a fielddata mode, such as the range filter[1]). So if your field
has index: no you cannot use it in filters, and if it has index: not_analyzed then you can, no matter whether doc values are enabled or not.

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-range-filter.html#_execution

What I'm trying to understand how we can optimize querying/filtering in a
large index (5 billion documents / 1 TB)? It's very hard to run a simple
term filter because a bitset filter will need to be calculated that
includes every single document. Wouldn't that utilize a lot of memory? Is
there a way to speed that up?

If your filters are unlikely to be reused, then you should not cache them
by setting _cache to false. Caching filters only make filtering faster when
the likelyhood of reusing filters is high.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6hE8CenTe9QfwWA5Rx45-mM%2BoOCSwPELOpsP_tKTGthA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(David Smith-2) #5

Thank you, Adrien. That answers my questions.

On Wednesday, July 16, 2014 5:24:36 AM UTC-4, Adrien Grand wrote:

On Tue, Jul 15, 2014 at 3:25 PM, David Smith <davidk...@gmail.com
<javascript:>> wrote:

Thanks, Adrien. That brings me closer.

So when the documentations say doc values do not support filtering, it's
talking about fielddata filtering for what's loaded into memory (anod not
filtering as part of a query... say term filter).

Exactly.

For further clarification - can a field that is not analyzed and only
kept as doc values be used for querying/filtering (say a term filter on a
numeric field or match query on a string field)? Or do all
querying/filtering required the field to be in the uninverted index?

Doc values play no role when filtering (except for some filters that
support a fielddata mode, such as the range filter[1]). So if your field
has index: no you cannot use it in filters, and if it has index: not_analyzed then you can, no matter whether doc values are enabled or not.

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-range-filter.html#_execution

What I'm trying to understand how we can optimize querying/filtering in a
large index (5 billion documents / 1 TB)? It's very hard to run a simple
term filter because a bitset filter will need to be calculated that
includes every single document. Wouldn't that utilize a lot of memory? Is
there a way to speed that up?

If your filters are unlikely to be reused, then you should not cache them
by setting _cache to false. Caching filters only make filtering faster when
the likelyhood of reusing filters is high.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c67ebf34-989b-4004-8b23-c9f7d00d9a13%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6