We're in the process of migrating servers and thought we'd also upgrade Elasticsearch in the process.
However, when installing Elasticsearch 2.1.0 on the new servers (which are much more powerful than the old ones, with our MySQL benchmarks running about 40% faster) we saw a huge decrease in aggregation performance (between 50 and 100%).
Has there been any changes to configuration that might explain this change? I know that doc values are now the default (whereas before we only used them for some fields in our index) but that shouldn't have such a huge impact (according to documentation), should it?
When looking at the nightly benchmarks for Elasticsearch, I'm seeing that on 1st December a huge spike occurs which makes search performance (aggregations, term, phrase) about 70% slower, so maybe this is the culprit. However I'm not seeing anyone talking about it so maybe I'm misunderstanding something.
Before we start downgrading and comparing different versions, I thought I'd ask here in case anyone has any ideas that might help.
I suspect doc values might be the culprit. What kind of disks are you using?
We're using SSDs.
Doc values is a logical assumption, it's the first thing that came to my mind as well. It's just that according to documentation the hit shouldn't be nearly as big.
The problem is, we can't really test the hypothesis without downgrading, as it seems that doc values is not just the default option in Elastic 2, it's the only option! I haven't found a way to revert back to field cache, as the documentation says that if you set doc values to false, this field will not have an index at all (and therefore won't be available for aggregation).
You should be able to disable doc values in the mapping, but I was not
aware that it might deem the field not eligible for aggregations:
I suspected doc values because whenever I read the same information about
it being only a small performance hit, they always assume SSDs are being
used. What about the density of the field? Dense or sparse? There have been
improvements with sparse doc values in the latest Lucene versions, but I do
not think they have made it to elasticsearch yet.
Yes, as you can see in the doc you linked to it says you can disable doc values "If you are
sure that you don’t need to sort or aggregate on a field, or access the field
value from a script".
How do you define the density of the field? Non-empty vs empty? All fields in the index has values, so in this sense, it is a very dense field. Unless you mean something different, like cardinality (in which case, the fields have a low cardinality).
I guess the only way to really test (and probably solve the problem) will be to downgrade, which is a shame. I'm wondering why field cache had to be completely removed, why couldn't doc values just become the default, with the option to change it back to field cache?
Doc values are not required to run aggregations in Elasticsearch 2.x however they will be required in 3.x (excpept for significant terms), so it's a good practice to keep them enabled when you plan to aggregate on a field. However doc values can't be changed on a live mapping, so you would need to create a new index with doc values disabled in the mappings, and reindex all your documents.
What kind of aggregations do you run? If you know of some simple aggregations that are reproducibly slower in 2.1 than in 1.6, that would be helpful.
Sometimes the hot threads API can help figure out what the bottleneck is: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html Just run the aggregation in a loop and run the hot threads API in order to see in what part of the code elasticsearch is spending time.
They're not? Then the docs are quite misleading! I will try disabling them and reindexing then, see if that will speed things up.
The aggregations we run are like advanced analytics reports. We're summing up stats (like clicks, impressions, sales etc.) per user, per campaign, per creative etc., all nested within each other. A big report can take minutes to load, so the -up to 100%- difference is significant. However, I don't think it's our specific usage that causes the problem, if I judge based on the nightly benchmarks that clearly show a 70% slow down in aggregations after 1st December!
I'll also take a look at the hot threads as you suggest and report back if anything interesting arises.
I was just asking in order to know what kind of aggregations you are using. From your description, it seems that you are running
Correct, mostly terms and sum.
Then the hot threads output should help. There are two phases when running a terms aggregation: building global ordinals (done once then cached until the next refresh, especially costly on high-cardinality fields) and collecting matching documents. The hot threads should help figure out which one of them is the bottleneck.
That is because Mike bought a new computer that has more, but slower, cores. See the annotation (X).
We are feeling this agg performance pain on es 2.1.1 too. Reindexing with docvalues false didn't help. Unfortunately, I guess we're going to 1.7.4 until the performance on 2.x reaches parity...
I am curious if you followed Adrien's advice on how to debug the issue with
outputting the hot threads. Currently working on a new aggregation heavy
project, but I do not have a 1.x cluster to do comparisons.
A little - we saw aggregations like below (building global ordinals) were hot and vast majority of the new time in 2.1.1 vs 1.5.2 was in our aggregations, although queries alone were somewhat slower too (~10%), possibly due to stuff like Elastic 2.0 slower query execution speed as 1.3 (and fix for that appears way out (unreleased Lucene 5.5 vs current 5.3.1), or attempt to modify all queries manually). The performance regressions we're seeing are in spite of our new cluster using much faster local SSDs and having no ambient load vs. our production reference instance. Seems like waiting for elasticsearch team to restore performance parity in future versions is most prudent choice.
Our aggs consist of lots of terms some with array-long-excludes and a smattering of filter, match_all, nested, reverse_nested, date_histogram, and stats.
Example hot agg trace:
Will report back if 1.7.4 fixes performance problems we're seeing with our new cluster probably Monday.
Ok, it was a little painful (https://github.com/elastic/elasticsearch-cloud-gce/issues/54#issuecomment-168580770) but I created a 1.7.4 cluster with identical everything else (except using GCE plugin 2.5.0) and performance for my aggregations is dramatically better. My reference, production cluster answers our aggregation-heavy query at roughly 1.2-1.5 queries / second (also serving customers!), the bigger (5 node vs 7 node) cluster with local instead of attached SSDs that I initially tested es 2 on got about .8 queries / second, but 1.7.4 on the same setup gets 2.6 queries / second.
We also downgraded to 1.7.4 on our new cluster and re-indexed.
Before the downgrade, a typical aggregation heavy query would take around 20 seconds to complete.
After the downgrade, the same query takes around 5.5 seconds.
So we are talking about HUGE performance differences between Elastic v1 and v2 for aggregation heavy queries. So it's elastic 1.7.4 for now, at least until v2 is dramatically improved.
For those who can see much better performance with elasticsearch 1.7, could you provide a whole hot threads output (taken on 2.x while aggs are running) so that we can get a better idea of where cpu goes?
The forum won't let me upload a txt file, or post such a big reply.
So you can find the link to download the txt here:
Thanks for helping. These hot threads are pointing to fielddata loading. If this is really the problem that you are having, then it would mean that this should only be an issue for the first requests (those that have to pay the price for fielddata loading). This code hasn't changed much sinch 1.7 so I would like to confirm this is the actual cause of the problem. Does the response time become acceptable if you run the request several times (say, 10 times) in a row?