Correct, mostly terms and sum.
Then the hot threads output should help. There are two phases when running a terms aggregation: building global ordinals (done once then cached until the next refresh, especially costly on high-cardinality fields) and collecting matching documents. The hot threads should help figure out which one of them is the bottleneck.
That is because Mike bought a new computer that has more, but slower, cores. See the annotation (X).
We are feeling this agg performance pain on es 2.1.1 too. Reindexing with docvalues false didn't help. Unfortunately, I guess we're going to 1.7.4 until the performance on 2.x reaches parity...
I am curious if you followed Adrien's advice on how to debug the issue with
outputting the hot threads. Currently working on a new aggregation heavy
project, but I do not have a 1.x cluster to do comparisons.
A little - we saw aggregations like below (building global ordinals) were hot and vast majority of the new time in 2.1.1 vs 1.5.2 was in our aggregations, although queries alone were somewhat slower too (~10%), possibly due to stuff like Elastic 2.0 slower query execution speed as 1.3 (and fix for that appears way out (unreleased Lucene 5.5 vs current 5.3.1), or attempt to modify all queries manually). The performance regressions we're seeing are in spite of our new cluster using much faster local SSDs and having no ambient load vs. our production reference instance. Seems like waiting for elasticsearch team to restore performance parity in future versions is most prudent choice.
Our aggs consist of lots of terms some with array-long-excludes and a smattering of filter, match_all, nested, reverse_nested, date_histogram, and stats.
Example hot agg trace:
org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator.getLeafCollector(GlobalOrdinalsStringTermsAggregator.java:94) org.elasticsearch.search.aggregations.AggregatorBase.getLeafCollector(AggregatorBase.java:132) org.elasticsearch.search.aggregations.AggregatorFactory$1$1.collect(AggregatorFactory.java:204) org.elasticsearch.search.aggregations.LeafBucketCollector$3.collect(LeafBucketCollector.java:73) org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectExistingBucket(BucketsAggregator.java:80) org.elasticsearch.search.aggregations.bucket.BucketsAggregator.collectBucket(BucketsAggregator.java:72) org.elasticsearch.search.aggregations.bucket.terms.LongTermsAggregator$1.collect(LongTermsAggregator.java:98) org.elasticsearch.search.aggregations.AggregatorFactory$1$1.collect(AggregatorFactory.java:208) org.elasticsearch.search.aggregations.LeafBucketCollector$3.collect(LeafBucketCollector.java:73)
Will report back if 1.7.4 fixes performance problems we're seeing with our new cluster probably Monday.
Ok, it was a little painful (https://github.com/elastic/elasticsearch-cloud-gce/issues/54#issuecomment-168580770) but I created a 1.7.4 cluster with identical everything else (except using GCE plugin 2.5.0) and performance for my aggregations is dramatically better. My reference, production cluster answers our aggregation-heavy query at roughly 1.2-1.5 queries / second (also serving customers!), the bigger (5 node vs 7 node) cluster with local instead of attached SSDs that I initially tested es 2 on got about .8 queries / second, but 1.7.4 on the same setup gets 2.6 queries / second.
We also downgraded to 1.7.4 on our new cluster and re-indexed.
Before the downgrade, a typical aggregation heavy query would take around 20 seconds to complete.
After the downgrade, the same query takes around 5.5 seconds.
So we are talking about HUGE performance differences between Elastic v1 and v2 for aggregation heavy queries. So it's elastic 1.7.4 for now, at least until v2 is dramatically improved.
For those who can see much better performance with elasticsearch 1.7, could you provide a whole hot threads output (taken on 2.x while aggs are running) so that we can get a better idea of where cpu goes?
The forum won't let me upload a txt file, or post such a big reply.
So you can find the link to download the txt here:
Thanks for helping. These hot threads are pointing to fielddata loading. If this is really the problem that you are having, then it would mean that this should only be an issue for the first requests (those that have to pay the price for fielddata loading). This code hasn't changed much sinch 1.7 so I would like to confirm this is the actual cause of the problem. Does the response time become acceptable if you run the request several times (say, 10 times) in a row?
No, that's not it I'm afraid. We run the same query over and over and yes, results come much faster after the first run, but still much slower than in 1.7.4. When I say it takes X seconds on 2.1.1 and Y seconds on 1.7.4 I always mean after we have ran it several times.
An interesting difference I am noticing is that when running the query on 2.1.1, it creates 1GB worth of fielddata and 0MB worth of filter cache. When I run the same query on 1.7.4 it creates 1.7GB worth of fielddata and 450MB worth of filter cache. So maybe something has changed in the code since 1.7?
Could you share your request and try to capture hot threads after fielddata has been loaded already to see where CPU goes in that case?
Unfortunately I've since torn down that cluster, but we tested by running our most common agg-heavy query hundreds of times against each configuration and came to same conclusions as @symos.
OK, how about this one:
I've run the query and taken 10 "snapshots" of the hot threads every 1-2 seconds (the query takes around 17 seconds to finish). So this will give you a better idea of where the CPU goes.
Bear in mind the same query on version 1.x takes around 3.5 seconds.
I can also send you the request privately if you need it.
Are you overriding the
index.store.type setting by any chance? I'm surprised that it seems to use
niofs while I would expect
default_fs. I don't expect it to be the root cause of the problem, but it might contribute.
I may have found the reason: https://github.com/elastic/elasticsearch/pull/15998