Optimizations for nested aggregations

I'm doing some benchmarking on aggregations on a dataset of about
50.000.000 documents. It's quite a complex aggregation, nesting 5 levels
deep. At the top level is a terms aggregation, and on the nested levels
there is a combination of terms, stats and percentiles aggregations on
deeper levels. With number of shards set to 15 and 20GB of heap on a 3
cluster setup in EC2 (m3.2xlarge with IOPS provisioned EBS), the
aggregation takes about 90 seconds, with quite some memory and CPU (about
50%) consumption. Doing this experiment leads me to some questions:

  • Will setting routing according to the terms in the top-level aggregation
    have any impact? I don't know how the aggregations work, but the assumption
    is that this will make sure all data for the sub aggregations are on the
    same node.
  • Does the aggregation operate on the entire document, or does it only load
    the fields that are aggregated?
  • Moving from a local dev environment with one node to a three node cluster
    didn't yield as much improvment as I expected. Am I hitting any "limits"
    with regards to aggregations of this complexity?
  • Are there any other optimizations that could affect the performance?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cc5eb156-d2f7-4a39-af77-83b9c0a42c7a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On Fri, May 23, 2014 at 10:50 AM, nilsga@gmail.com wrote:

I'm doing some benchmarking on aggregations on a dataset of about
50.000.000 documents. It's quite a complex aggregation, nesting 5 levels
deep. At the top level is a terms aggregation, and on the nested levels
there is a combination of terms, stats and percentiles aggregations on
deeper levels. With number of shards set to 15 and 20GB of heap on a 3
cluster setup in EC2 (m3.2xlarge with IOPS provisioned EBS), the
aggregation takes about 90 seconds, with quite some memory and CPU (about
50%) consumption. Doing this experiment leads me to some questions:

  • Will setting routing according to the terms in the top-level aggregation
    have any impact? I don't know how the aggregations work, but the assumption
    is that this will make sure all data for the sub aggregations are on the
    same node.

What is sure is that it will help with accuracy since you won't suffer from
this limitation of the terms aggregation anymore:
https://github.com/elasticsearch/elasticsearch/issues/1305 (the issue is
for facets, but aggregations work in a similar way).

Regarding performance, it should help since each shard will have 15x fewer
terms to take care of. I said "should" because depending on the cardinality
and distribution of the field of your top terms aggregation, this might
cause your shards to be balanced worse, and the shard with the most
documents might be a bottleneck.

  • Does the aggregation operate on the entire document, or does it only load

the fields that are aggregated?

Only fields that are aggregated.

  • Moving from a local dev environment with one node to a three node
    cluster didn't yield as much improvment as I expected. Am I hitting any
    "limits" with regards to aggregations of this complexity?
  • Are there any other optimizations that could affect the performance?

Regarding these two last questions, I would recommend to try upgrading to
Elasticsearch 1.2.0 and see if it makes things better. We have put efforts
into improving the memory footprint of multi-level aggregations (which
might in turn improve performance) as well as the performance of terms
aggregations thanks to global ordinals. See
Elasticsearch Platform — Find real-time answers at scale | Elastic for more
information.

Since you mentionned percentiles, I'd like to mention that although useful
they tend to be costly to compute. You can try to play with the compression
(
Elasticsearch Platform — Find real-time answers at scale | Elastic)
of this aggregation to see if it helps performance: it will make the
aggregation faster and more memory-efficient at the cost of some accuracy
loss. Default value is 100, so you can try eg. 10 to see if it yields any
performance improvement.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j45miVN9qMYNKXr%3DoP8JT5dDOX22B1nBKP_MUFu2gi%2BLg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thank you for your reply.

Here are some observations from a couple of days testing:

  • Setting up routing manually reduced the aggregation time about 40%!
  • ... however, manual routing caused data to distribute unevenly. I assume
    we could have taken steps to improve the distribution, but we didn't
    investigate any further
  • Upgrading from 1.1.0 to 1.2.0 didn't seem to improve speed nor memory
    usage, although we didn't do any accurate measurements of RAM usage
  • Changing the compression value for percentiles did indeed have an effect
  • Increasing number of nodes from 3 to 5 didn't seem to improve performance

Since adding additional nodes didn't seem to improve the performance, there
seem to be a bottleneck somewhere. The result of the aggregation is very
large (as a JSON-result, it results in about a million lines of text), so
maybe data transfer or constructing the result can be a bottleneck?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f52e564d-fd7a-4355-953d-2943c3e910c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Massive JSON responses could indeed be a problem. I think you can easily
see if CPU, Disk, or Network are the bottleneck using really any monitoring
tool. Even dstat --cpu --mem --disk --net will give you an idea. :slight_smile:

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Tuesday, May 27, 2014 6:44:45 AM UTC-4, nil...@gmail.com wrote:

Thank you for your reply.

Here are some observations from a couple of days testing:

  • Setting up routing manually reduced the aggregation time about 40%!
  • ... however, manual routing caused data to distribute unevenly. I assume
    we could have taken steps to improve the distribution, but we didn't
    investigate any further
  • Upgrading from 1.1.0 to 1.2.0 didn't seem to improve speed nor memory
    usage, although we didn't do any accurate measurements of RAM usage
  • Changing the compression value for percentiles did indeed have an effect
  • Increasing number of nodes from 3 to 5 didn't seem to improve performance

Since adding additional nodes didn't seem to improve the performance,
there seem to be a bottleneck somewhere. The result of the aggregation is
very large (as a JSON-result, it results in about a million lines of text),
so maybe data transfer or constructing the result can be a bottleneck?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ade6258f-18bb-46bd-9d69-502ea64d3db2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

We're using the Java API. I assume that it is using a binary representation
of some kind that is more compact than JSON. I just mentioned JSON to
illustrate the size of the response. I'll certainly try to monitor
disk/network activity.

Nils-H

On Friday, May 30, 2014 7:16:38 AM UTC+2, Otis Gospodnetic wrote:

Massive JSON responses could indeed be a problem. I think you can easily
see if CPU, Disk, or Network are the bottleneck using really any monitoring
tool. Even dstat --cpu --mem --disk --net will give you an idea. :slight_smile:

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Tuesday, May 27, 2014 6:44:45 AM UTC-4, nil...@gmail.com wrote:

Thank you for your reply.

Here are some observations from a couple of days testing:

  • Setting up routing manually reduced the aggregation time about 40%!
  • ... however, manual routing caused data to distribute unevenly. I
    assume we could have taken steps to improve the distribution, but we didn't
    investigate any further
  • Upgrading from 1.1.0 to 1.2.0 didn't seem to improve speed nor memory
    usage, although we didn't do any accurate measurements of RAM usage
  • Changing the compression value for percentiles did indeed have an effect
  • Increasing number of nodes from 3 to 5 didn't seem to improve
    performance

Since adding additional nodes didn't seem to improve the performance,
there seem to be a bottleneck somewhere. The result of the aggregation is
very large (as a JSON-result, it results in about a million lines of text),
so maybe data transfer or constructing the result can be a bottleneck?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c626684e-a3f3-4188-9bce-368853138315%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.