Troubleshooting Slow Aggregation Query

codyaray · January 6, 2016, 8:47pm

I'm trying to understand why this aggregation query is slow. I've asked about it in IRC and it was suggested to open it up to a wider audience. Using Elasticsearch 1.6.0. (Have a lot of data to migrate during upgrades.)

The exact query is shown below as a Gist along with its response. The query has a single nested aggregation. The time range covers 12 hours and 187M documents are returned from the query.

Repeatedly running the query without the aggregations produces these latencies:
3.7s, 3.0s, 2.1s, 2.0s, 0.99s, 1.1s, 0.80s, 0.78s, 0.93s, 1.0s, 0.80s, 0.83s

Then running with just the top-level aggregation, we see these latencies:
4.3s, 4.3s, 4.5s, 3.1s, 4.6s, 3.4s, 3.5s, 3.5s, 3.1s, 3.3s

Finally, running the original query with the nested aggregation, we see even longer latencies:
9.0s, 7.5s, 5.1s, 4.8s, 5.2s, 6.5s, 5.1s, 5.6s, 5.4s, 5.8s, 5.4s

NB: All runs were immediately following each other. No aggs, one agg, and then both aggs.

From these measurements, with warmed caches it seems like the query itself takes ~1s, the top-level aggregation takes an additional 2.5s, and the nested query takes an additional 2s.

As this is only one of nine similar queries on one of our Kibana dashboards, I can see how this isn't pleasant to use for our support team. I'm hoping one of you can help me figure out what's wrong here, or how it can be improved, or even a better tool for this type of usage.

Example Query and Response

This contains the query referenced above as well as the output for the nested aggregations case. For the top-level aggregations, I simply removed the inner "aggs" block. For the no-aggregation case, I deleted everything after '"size": 0' in the query.

gist.github.com

https://gist.github.com/codyaray/5d8d1c6766acd2134d07

query.sh

time curl "localhost:9200/stats-2016.01.05/_search?pretty" -d '{
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "analyze_wildcard": false,
          "query": "_type:nextgen-stats",
          "lowercase_expanded_terms": false
        }
      },

This file has been truncated. show original

response.json

{
  "took": 7589,
  "timed_out": false,
  "_shards": {
    "total": 8,
    "successful": 8,
    "failed": 0
  },
  "hits": {
    "total": 208644986,

This file has been truncated. show original

Mappings and Sample Documents

See mappings.json for the mappings for this index. The rest of the files are example documents for each type stored within this index.

gist.github.com

https://gist.github.com/codyaray/15e358a3fca302b610c2

Feedster.json

{
  "_index": "stats-2015.11.09",
  "_type": "Feedster",
  "_id": "AVDrFpcuL69rrYzM59Sx",
  "_score": 1,
  "_source": {
    "@timestamp": "2015-11-09T07:12:16.011Z",
    "line": "ABC_000",
    "host": "somehost01",
    "type": "Feedster",

This file has been truncated. show original

feed-latency.json

{
  "_index": "stats-2015.11.09",
  "_type": "feed-latency",
  "_id": "AVDsb9KPL69rrYzM9eE5",
  "_score": 1,
  "_source": {
    "app": "someapp",
    "cache-latency": 0,
    "@timestamp": "2015-11-09T13:29:22.269Z",
    "service": "some.fancy.service",

This file has been truncated. show original

logs.json

{
  "_index": "stats-2015.11.09",
  "_type": "logs",
  "_id": "AVDrh0xmL69rrYzMFNS9",
  "_score": 1,
  "_source": {
    "message": "service_name\":\"someservice\",\"num_events_received\":12345,\"host\":\"somehost01\",\"type\":\"nextgen-stats\",\"user\":\"codyaray\"}",
    "tags": [
      "_jsonparsefailure",
      "stats"

This file has been truncated. show original

There are more than three files. show original

From what everyone in IRC said, this doesn't appear to be too complex of an aggregation query or too much data, so we're not sure why its taking so long.

It looks like our CPUs may be too hot, likely from indexing. I'm leaning toward this being the source of the problem now but wanted to throw this out there for other ideas or ways to prove it.

Thanks in advance for your help,

-Cody

jpountz · January 6, 2016, 10:37pm

I don't think there is any issue here: it just takes time to compute the aggregates on 200M documents. If you want to speed it up, I think the most efficient option would be to run this query on a shorter range. For instance right now this is running aggs on the last 12 hours of data. You could change defaults to run it only on the last 6 hours of data or even only 3.

Otherwise I'm afraid the only option would be to buy beefier hardware.

codyaray · January 8, 2016, 6:51pm

I find this surprising as we're running on pretty beefy hardware as-is.

8 tier1 nodes with 12 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz, 62GB RAM, and 1TB ssd
8 tier2 nodes with 8 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz, 62GB RAM, and 5TB ssd
1 tier3 node with 12 Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz, 46GB RAM, and 37TB hdd

This query would be reading from one shard on each of the 8 tier1 nodes (each index has 8 shards) only, since its only querying the past 12 hours. We had almost no querying at the time I made this query but still had lots of indexing action (hence my concern about the CPU usage). My other theory is that it may be related the sparsity of my data. (This index contains six types, but about 77% percent of the docs are from one type.)

-Cody

Sarwar · January 8, 2016, 7:51pm

Your query_string does not seem to be doing much so first of all, I'd probably start with just rewriting the query as:

gist.github.com

https://gist.github.com/sarwarbhuiyan/2ad8132f8e52b6d18708

gistfile1.txt

time curl "localhost:9200/stats-2016.01.05/nextgen-stats/_search?pretty" -d '{
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "bool": {
          "must": [
            {

This file has been truncated. show original

Next, I'd look at your filter section. Where are these date ranges coming from? Are they looking at "now - 12h"? for example? If so, we can do something like this to better make use of caching (I'm not sure what part of your query is cached now since it's just arbitrary date range in filter and you didn't explicitly set _cache to true there). An example of this might be:

gist.github.com

https://gist.github.com/sarwarbhuiyan/8b4c74ce7a7cbd9a2f56

gistfile1.txt

time curl "localhost:9200/stats-2016.01.05/nextgen-stats/_search?pretty" -d '{
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "bool": {
          "must": [
            {

This file has been truncated. show original

This would make the range filter cache now-12 hours rounded to the hour and that would be cached. If you wanted to filter down to exactly 12 hours from now, just need to add one more must clause after it which is now-12. The first cacheable filter will make it faster probably.

I hope this gives some ideas.

Sarwar

codyaray · January 11, 2016, 5:27pm

Thanks for the suggestions. Unfortunately, running your last example still takes 5s and up to 18s to execute. Still trying to figure out what the root issues are here... (Working on reindexing an index from prod to test different mapping theories out now.)

codyaray · January 11, 2016, 8:04pm

Maybe this helps. This is a snapshot of our hot_threads during the day. Almost all the hottest threads are Lucene Merges. Does this look normal?

gist.github.com

https://gist.github.com/codyaray/0fd390ae9f64f28d317e#file-hot_threads-txt-L88

hot_threads.txt

::: [svlchi6pesm03.peak6.net][0zQiOjCySx2lc2CjKKAuxQ][svlchi6pesm03][inet[/10.2.4.180:9300]]{data=false, box_type=mgmt, master=true}
   Hot threads at 2016-01-11T19:40:08.163Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    0.0% (198.7micros out of 500ms) cpu usage by thread 'elasticsearch[svlchi6pesm03.peak6.net][keep_alive]'
     10/10 snapshots sharing following 3 elements
       java.lang.Thread.sleep(Native Method)
       org.elasticsearch.marvel.agent.exporter.ESExporter$ConnectionKeepAliveWorker.run(ESExporter.java:788)
       java.lang.Thread.run(Thread.java:745)

::: [svlchi6pes02.peak6.net][7QVV3C62QeyeH9RySjm7Cg][svlchi6pes02][inet[/10.2.4.182:9300]]{box_type=slow, master=false}

This file has been truncated. show original

This is back to theory #1 - why I'm thinking that query aggregations are slow due to CPU resource constraint.

Thoughts?

Sarwar · January 14, 2016, 3:02pm

It is a possibility that merges are being throttled and you have lots of small segments. Do you have any idea of how big your shards and segments are?

Topic		Replies	Views
Nested aggregation slows query from ~400ms to 10s Elasticsearch	11	1472	October 8, 2018
Aggregation on nested document always takes 2-3 seconds? Elasticsearch	2	528	July 6, 2017
Slow aggregation queries, only after data change (ES 2.3) Elasticsearch	9	1326	December 26, 2016
Nested aggregation of 7 documents takes 1 min Elasticsearch	22	819	March 3, 2020
Extremely slow nested aggregations, need suggestions on modeling/shards Elasticsearch	1	483	July 5, 2017

Troubleshooting Slow Aggregation Query

Example Query and Response

Mappings and Sample Documents

Related topics