Filter aggregation and nested documents


(Olivier B) #1

Hi all,

I'm working with nested documents (like millions of documents) and I do
aggregation on nested documents. And of course, I need to use filter
aggregation
(http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-filter-aggregation.html),
however this does not seems to work with nested documents:

{
"aggs": {
"items": {
"nested": {
"path": "items"
},
"filter": {
"ids": {
"values": [
"2AA4CE67-9469-4AE7-AC99-46F7E2646C2F"
]
}
},
"aggs": {
"questions": {
"terms": {
"field": "items.question_label.raw",
"size": 0
}
}
}
}
}
}

Response:
Parse Failure [Found two aggregation type definitions in [items]: [nested]
and [filter]. Only one type is allowed.]]; }]

So, i tried an other way:
{
"query": {
"filtered": {
"filter": {
"ids": {
"values": [
"2AA4CE67-9469-4AE7-AC99-46F7E2646C2F"
]
}
}
}
},
"aggs": {
"items": {
"nested": {
"path": "items"
},
"aggs": {
"questions": {
"terms": {
"field": "items.question_label.raw",
"size": 0
}
}
}
}
}
}

In that case, this is working. But:

  • it takes several seconds,
  • the cache is filled up very quickly
  • because the cache is full, it refuses new queries (i'm using ES 1.1.1
    with Circuit Breaker)
    Of course, this is not acceptable for production.

So basically, i've millions of documents but i do aggregation in my example
within a single documents containing around 100 documents with 10 fields
and... it's taking 2Gb of memory for the data cache and takes several
seconds.
My guess is, the filtering is not very useful and do aggregation on all
documents before filtering (and not the contrary as I expect).

Is there any better solution for filter aggregation with nested documents?

Many thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4bf1cf1d-8f4b-41f1-add1-efa952691b64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Binh Ly-2) #2

You are correct. Unfortunately the fielddata is loaded for all docs
regardless of filter condition. You can:

  1. Add more RAM

  2. Add more nodes (and shard your index out so that RAM usage will
    distributed across multiple nodes)

  3. Use disk-based fielddata (fielddata will not be loaded into memory) for
    the field/s you are aggregating on. This will run slower and you have to
    reindex your data.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/14bf25b7-a973-448a-866f-425d38001d7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Olivier B) #3

Thanks you.
OK, that's what I was fearing: the cache is loaded regardless of the filter
condition. Which is a shame, even if we filter a lot, targeting only one
document, we still need to fill up the cache!
I will try to have a lot of RAM and see if I'm reaching a stable memory
occupation and let the cache living like that.
Alternative solution is to have many indexes, each index will act as a
pre-filter and contains way less data.
Do you know if the fielddata cache is loading all docs, or only the
relevant shard? Would it help to have smaller shards?

On Monday, April 28, 2014 11:55:22 PM UTC+10, Binh Ly wrote:

You are correct. Unfortunately the fielddata is loaded for all docs
regardless of filter condition. You can:

  1. Add more RAM

  2. Add more nodes (and shard your index out so that RAM usage will
    distributed across multiple nodes)

  3. Use disk-based fielddata (fielddata will not be loaded into memory) for
    the field/s you are aggregating on. This will run slower and you have to
    reindex your data.

http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6a46345d-da2e-403c-8c9f-d47de4b70bac%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(x0ne-2) #4

When fielddata is loaded, is it only that of which the aggregation job
needs (items.question_label.raw in this case) or does it load the full
_source of every match and extract the field?

On Monday, April 28, 2014 9:04:09 PM UTC-4, Olivier B wrote:

Thanks you.
OK, that's what I was fearing: the cache is loaded regardless of the
filter condition. Which is a shame, even if we filter a lot, targeting only
one document, we still need to fill up the cache!
I will try to have a lot of RAM and see if I'm reaching a stable memory
occupation and let the cache living like that.
Alternative solution is to have many indexes, each index will act as a
pre-filter and contains way less data.
Do you know if the fielddata cache is loading all docs, or only the
relevant shard? Would it help to have smaller shards?

On Monday, April 28, 2014 11:55:22 PM UTC+10, Binh Ly wrote:

You are correct. Unfortunately the fielddata is loaded for all docs
regardless of filter condition. You can:

  1. Add more RAM

  2. Add more nodes (and shard your index out so that RAM usage will
    distributed across multiple nodes)

  3. Use disk-based fielddata (fielddata will not be loaded into memory)
    for the field/s you are aggregating on. This will run slower and you have
    to reindex your data.

http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/352608c0-ffbe-4fbd-ab5e-9c5809137bb0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5