On Fri, May 23, 2014 at 10:50 AM, firstname.lastname@example.org wrote:
I'm doing some benchmarking on aggregations on a dataset of about
50.000.000 documents. It's quite a complex aggregation, nesting 5 levels
deep. At the top level is a terms aggregation, and on the nested levels
there is a combination of terms, stats and percentiles aggregations on
deeper levels. With number of shards set to 15 and 20GB of heap on a 3
cluster setup in EC2 (m3.2xlarge with IOPS provisioned EBS), the
aggregation takes about 90 seconds, with quite some memory and CPU (about
50%) consumption. Doing this experiment leads me to some questions:
- Will setting routing according to the terms in the top-level aggregation
have any impact? I don't know how the aggregations work, but the assumption
is that this will make sure all data for the sub aggregations are on the
What is sure is that it will help with accuracy since you won't suffer from
this limitation of the terms aggregation anymore:
https://github.com/elasticsearch/elasticsearch/issues/1305 (the issue is
for facets, but aggregations work in a similar way).
Regarding performance, it should help since each shard will have 15x fewer
terms to take care of. I said "should" because depending on the cardinality
and distribution of the field of your top terms aggregation, this might
cause your shards to be balanced worse, and the shard with the most
documents might be a bottleneck.
- Does the aggregation operate on the entire document, or does it only load
the fields that are aggregated?
Only fields that are aggregated.
- Moving from a local dev environment with one node to a three node
cluster didn't yield as much improvment as I expected. Am I hitting any
"limits" with regards to aggregations of this complexity?
- Are there any other optimizations that could affect the performance?
Regarding these two last questions, I would recommend to try upgrading to
Elasticsearch 1.2.0 and see if it makes things better. We have put efforts
into improving the memory footprint of multi-level aggregations (which
might in turn improve performance) as well as the performance of terms
aggregations thanks to global ordinals. See
http://www.elasticsearch.org/blog/elasticsearch-1-2-0-released/ for more
Since you mentionned percentiles, I'd like to mention that although useful
they tend to be costly to compute. You can try to play with the compression
of this aggregation to see if it helps performance: it will make the
aggregation faster and more memory-efficient at the cost of some accuracy
loss. Default value is 100, so you can try eg. 10 to see if it yields any
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j45miVN9qMYNKXr%3DoP8JT5dDOX22B1nBKP_MUFu2gi%2BLg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.