Nested aggregation kills elasticsearch cluster

I have a 2-node elasticsearch cluster (v2.3.1, and also tested against v2.3.4 - same behaviour, and on a single node es instance - again, same behaviour).

Each node is on it's own AWS instance (m3.2xlarge). The ES_HEAP_SIZE is set as 15g (on a 30g machine).

We tried creating a kibana visualization:


It results in this query:

{
  "size": 0,
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "analyze_wildcard": true,
          "query": "*"
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "MyDateField": {
                  "gte": 1314114051459,
                  "lte": 1471966851459,
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      }
    }
  },
  "aggs": {
    "2": {
      "terms": {
        "field": "MyStringField",
        "size": 10,
        "order": {
          "_count": "desc"
        }
      },
      "aggs": {
        "3": {
          "terms": {
            "field": "MyOtherStringField",
            "size": 50,
            "order": {
              "_count": "desc"
            }
          }
        }
      }
    }
  }
}

The visualization times out after 30 seconds, and the es cluster becomes unresponsive.

Looking in the logs for elasticsearch, there is nothing except jvm.monitor warning messages, suggesting that the system is under heavy load.

After a good while, the system remains unresponsive. Trying to shut it down using:
sudo service elasticsearch stop doesn't work. In the end we have to kill the es process on each node in turn and start it up again manually.

We can re-create with direct queries like this:
This query works:

 {
   "size": 0,
   "aggs": {
     "2": {
       "terms": {
         "field": "MyStringField",
         "size": 10,
         "order": {
           "_count": "desc"
         }
       }
     }
   }
 }

This query kills the cluster (note nested agg):

 {
   "size": 0,
   "aggs": {
     "2": {
       "terms": {
         "field": "MyStringField",
         "size": 10,
         "order": {
           "_count": "desc"
         }
       },
       "aggs": {
         "3": {
           "terms": {
             "field": "MyOtherStringField",
             "size": 50,
             "order": {
               "_count": "desc"
             }
           }
         }
       }
     }
   }
 }

My questions are:

  • Why is this happening?
  • If the query is simply too heavy, are there no circuit breakers for this sort of thing?
  • How can I stop this from happening and improve reliability of our cluster?

Please note: We cannot upgrade to v2.3.5 because we need to use the knapsack plugin, which has yet to deliver a version for v2.3.5.

I would appreciate any help that anyone can offer. If you need any more information, please let me know and I'll happily provide it.

Thanks!

Check out the use of breadth_first expressions [1]. It is possible to include these in the "advanced" section of Kibana.
Work is underway to try figure out when to turn this policy on automatically but it's not an easy problem so for now you will have to do it manually.

[1] https://www.elastic.co/guide/en/elasticsearch/guide/current/_preventing_combinatorial_explosions.html

1 Like

Thanks @Mark_Harwood! That was massively helpful.

Just wondering: Are there plans to add a circuit breaker as well as automatically switching on breadth_first? Is it even possible to do that?

Also, do you have any docs on how you turn this on in Kibana's Advanced section?

We do have a circuit breaker that accounts for as many things as we can reasonably measure but it admittedly doesn't catch 100% of issues.

There was an initial change recently committed to support this [1]. I'm sure there will still be cases where it makes the wrong decision because it can't predict how selective a query might be so end-user control will be required.

This is a "killer" query example - top 5 tags on stackoverflow and top 5 users responding to them:

[1] Define good heuristics to use `collect_mode: breadth_first` · Issue #9825 · elastic/elasticsearch · GitHub

3 Likes