Troubleshooting Slow Aggregation Query

I'm trying to understand why this aggregation query is slow. I've asked about it in IRC and it was suggested to open it up to a wider audience. Using Elasticsearch 1.6.0. (Have a lot of data to migrate during upgrades.)

The exact query is shown below as a Gist along with its response. The query has a single nested aggregation. The time range covers 12 hours and 187M documents are returned from the query.

Repeatedly running the query without the aggregations produces these latencies:
3.7s, 3.0s, 2.1s, 2.0s, 0.99s, 1.1s, 0.80s, 0.78s, 0.93s, 1.0s, 0.80s, 0.83s

Then running with just the top-level aggregation, we see these latencies:
4.3s, 4.3s, 4.5s, 3.1s, 4.6s, 3.4s, 3.5s, 3.5s, 3.1s, 3.3s

Finally, running the original query with the nested aggregation, we see even longer latencies:
9.0s, 7.5s, 5.1s, 4.8s, 5.2s, 6.5s, 5.1s, 5.6s, 5.4s, 5.8s, 5.4s

NB: All runs were immediately following each other. No aggs, one agg, and then both aggs.

From these measurements, with warmed caches it seems like the query itself takes ~1s, the top-level aggregation takes an additional 2.5s, and the nested query takes an additional 2s.

As this is only one of nine similar queries on one of our Kibana dashboards, I can see how this isn't pleasant to use for our support team. I'm hoping one of you can help me figure out what's wrong here, or how it can be improved, or even a better tool for this type of usage.

Example Query and Response

This contains the query referenced above as well as the output for the nested aggregations case. For the top-level aggregations, I simply removed the inner "aggs" block. For the no-aggregation case, I deleted everything after '"size": 0' in the query.

Mappings and Sample Documents

See mappings.json for the mappings for this index. The rest of the files are example documents for each type stored within this index.

From what everyone in IRC said, this doesn't appear to be too complex of an aggregation query or too much data, so we're not sure why its taking so long.

It looks like our CPUs may be too hot, likely from indexing. I'm leaning toward this being the source of the problem now but wanted to throw this out there for other ideas or ways to prove it.

Thanks in advance for your help,

-Cody

I don't think there is any issue here: it just takes time to compute the aggregates on 200M documents. If you want to speed it up, I think the most efficient option would be to run this query on a shorter range. For instance right now this is running aggs on the last 12 hours of data. You could change defaults to run it only on the last 6 hours of data or even only 3.

Otherwise I'm afraid the only option would be to buy beefier hardware.

I find this surprising as we're running on pretty beefy hardware as-is.

8 tier1 nodes with 12 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz, 62GB RAM, and 1TB ssd
8 tier2 nodes with 8 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 62GB RAM, and 5TB ssd
1 tier3 node with 12 Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz, 46GB RAM, and 37TB hdd

This query would be reading from one shard on each of the 8 tier1 nodes (each index has 8 shards) only, since its only querying the past 12 hours. We had almost no querying at the time I made this query but still had lots of indexing action (hence my concern about the CPU usage). My other theory is that it may be related the sparsity of my data. (This index contains six types, but about 77% percent of the docs are from one type.)

-Cody

Your query_string does not seem to be doing much so first of all, I'd probably start with just rewriting the query as:

Next, I'd look at your filter section. Where are these date ranges coming from? Are they looking at "now - 12h"? for example? If so, we can do something like this to better make use of caching (I'm not sure what part of your query is cached now since it's just arbitrary date range in filter and you didn't explicitly set _cache to true there). An example of this might be:

This would make the range filter cache now-12 hours rounded to the hour and that would be cached. If you wanted to filter down to exactly 12 hours from now, just need to add one more must clause after it which is now-12. The first cacheable filter will make it faster probably.

I hope this gives some ideas.

Sarwar

Thanks for the suggestions. Unfortunately, running your last example still takes 5s and up to 18s to execute. Still trying to figure out what the root issues are here... (Working on reindexing an index from prod to test different mapping theories out now.)

Maybe this helps. This is a snapshot of our hot_threads during the day. Almost all the hottest threads are Lucene Merges. Does this look normal?

This is back to theory #1 - why I'm thinking that query aggregations are slow due to CPU resource constraint.

Thoughts?

It is a possibility that merges are being throttled and you have lots of small segments. Do you have any idea of how big your shards and segments are?