I'm trying to understand why this aggregation query is slow. I've asked about it in IRC and it was suggested to open it up to a wider audience. Using Elasticsearch 1.6.0. (Have a lot of data to migrate during upgrades.)
The exact query is shown below as a Gist along with its response. The query has a single nested aggregation. The time range covers 12 hours and 187M documents are returned from the query.
Repeatedly running the query without the aggregations produces these latencies:
3.7s, 3.0s, 2.1s, 2.0s, 0.99s, 1.1s, 0.80s, 0.78s, 0.93s, 1.0s, 0.80s, 0.83s
Then running with just the top-level aggregation, we see these latencies:
4.3s, 4.3s, 4.5s, 3.1s, 4.6s, 3.4s, 3.5s, 3.5s, 3.1s, 3.3s
Finally, running the original query with the nested aggregation, we see even longer latencies:
9.0s, 7.5s, 5.1s, 4.8s, 5.2s, 6.5s, 5.1s, 5.6s, 5.4s, 5.8s, 5.4s
NB: All runs were immediately following each other. No aggs, one agg, and then both aggs.
From these measurements, with warmed caches it seems like the query itself takes ~1s, the top-level aggregation takes an additional 2.5s, and the nested query takes an additional 2s.
As this is only one of nine similar queries on one of our Kibana dashboards, I can see how this isn't pleasant to use for our support team. I'm hoping one of you can help me figure out what's wrong here, or how it can be improved, or even a better tool for this type of usage.
Example Query and Response
This contains the query referenced above as well as the output for the nested aggregations case. For the top-level aggregations, I simply removed the inner "aggs" block. For the no-aggregation case, I deleted everything after '"size": 0' in the query.
Mappings and Sample Documents
See mappings.json for the mappings for this index. The rest of the files are example documents for each type stored within this index.
From what everyone in IRC said, this doesn't appear to be too complex of an aggregation query or too much data, so we're not sure why its taking so long.
It looks like our CPUs may be too hot, likely from indexing. I'm leaning toward this being the source of the problem now but wanted to throw this out there for other ideas or ways to prove it.
Thanks in advance for your help,
-Cody