When you map fields to use doc values for field data, does that limit the functionality afforded to those fields to merely sorting and aggregations/faceting?
The documentation mentions that filtering is not supported by numeric or string types when stored as doc values. Yikes, I thought that doc values is intended for working with field data when it's too large to load into memory. Is that not the case?
I read both of the following pages but I'm not sure I quite understand where the usefulness of field data fields kick in.
Doc values are a way to compute field data at indexing time, and to store
it on disk. It can do everything that "uninverted" field data can do:
aggregations, sorting, etc. However, it never kicks in automatically: it
needs to be configured explicitely, and can only be set at index creation
time, you cannot enable it afterwards.
Regarding fielddata filtering, it is a way to trade accuracy for memory by
only loading "important" terms into memory and doesn't work with doc values
since it's not useful given that they are stored on disk anyway (and thus
don't require much memory).
When you map fields to use doc values for field data, does that limit the
functionality afforded to those fields to merely sorting and
aggregations/faceting?
The documentation mentions that filtering is not supported by numeric or
string types when stored as doc values. Yikes, I thought that doc values is
intended for working with field data when it's too large to load into
memory. Is that not the case?
I read both of the following pages but I'm not sure I quite understand
where the usefulness of field data fields kick in.
So when the documentations say doc values do not support filtering, it's
talking about fielddata filtering for what's loaded into memory (anod not
filtering as part of a query... say term filter). For further clarification
can a field that is not analyzed and only kept as doc values be used for
querying/filtering (say a term filter on a numeric field or match query on
a string field)? Or do all querying/filtering required the field to be in
the uninverted index?
What I'm trying to understand how we can optimize querying/filtering in a
large index (5 billion documents / 1 TB)? It's very hard to run a simple
term filter because a bitset filter will need to be calculated that
includes every single document. Wouldn't that utilize a lot of memory? Is
there a way to speed that up?
Doc values are a way to compute field data at indexing time, and to store
it on disk. It can do everything that "uninverted" field data can do:
aggregations, sorting, etc. However, it never kicks in automatically: it
needs to be configured explicitely, and can only be set at index creation
time, you cannot enable it afterwards.
Regarding fielddata filtering, it is a way to trade accuracy for memory by
only loading "important" terms into memory and doesn't work with doc values
since it's not useful given that they are stored on disk anyway (and thus
don't require much memory).
When you map fields to use doc values for field data, does that limit the
functionality afforded to those fields to merely sorting and
aggregations/faceting?
The documentation mentions that filtering is not supported by numeric or
string types when stored as doc values. Yikes, I thought that doc values is
intended for working with field data when it's too large to load into
memory. Is that not the case?
I read both of the following pages but I'm not sure I quite understand
where the usefulness of field data fields kick in.
So when the documentations say doc values do not support filtering, it's
talking about fielddata filtering for what's loaded into memory (anod not
filtering as part of a query... say term filter).
Exactly.
For further clarification - can a field that is not analyzed and only kept
as doc values be used for querying/filtering (say a term filter on a
numeric field or match query on a string field)? Or do all
querying/filtering required the field to be in the uninverted index?
Doc values play no role when filtering (except for some filters that
support a fielddata mode, such as the range filter[1]). So if your field
has index: no you cannot use it in filters, and if it has index: not_analyzed then you can, no matter whether doc values are enabled or not.
[1]
What I'm trying to understand how we can optimize querying/filtering in a
large index (5 billion documents / 1 TB)? It's very hard to run a simple
term filter because a bitset filter will need to be calculated that
includes every single document. Wouldn't that utilize a lot of memory? Is
there a way to speed that up?
If your filters are unlikely to be reused, then you should not cache them
by setting _cache to false. Caching filters only make filtering faster when
the likelyhood of reusing filters is high.
On Wednesday, July 16, 2014 5:24:36 AM UTC-4, Adrien Grand wrote:
On Tue, Jul 15, 2014 at 3:25 PM, David Smith <davidk...@gmail.com
<javascript:>> wrote:
Thanks, Adrien. That brings me closer.
So when the documentations say doc values do not support filtering, it's
talking about fielddata filtering for what's loaded into memory (anod not
filtering as part of a query... say term filter).
Exactly.
For further clarification - can a field that is not analyzed and only
kept as doc values be used for querying/filtering (say a term filter on a
numeric field or match query on a string field)? Or do all
querying/filtering required the field to be in the uninverted index?
Doc values play no role when filtering (except for some filters that
support a fielddata mode, such as the range filter[1]). So if your field
has index: no you cannot use it in filters, and if it has index: not_analyzed then you can, no matter whether doc values are enabled or not.
What I'm trying to understand how we can optimize querying/filtering in a
large index (5 billion documents / 1 TB)? It's very hard to run a simple
term filter because a bitset filter will need to be calculated that
includes every single document. Wouldn't that utilize a lot of memory? Is
there a way to speed that up?
If your filters are unlikely to be reused, then you should not cache them
by setting _cache to false. Caching filters only make filtering faster when
the likelyhood of reusing filters is high.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.