Why ES doesn't stop my aggregation but just crashes?

Hello, I am learning Elasticsearch basics and I am dealing with an Out of Memory error when performing aggregations with a large number of buckets.

I already know that for aggregations with 10000+ bucket I should use composite aggregation, but sometimes this cannot be done (e.g. queries auto-generated by Grafana). I don't understand why ES allows me to do a query that crashes it, and do not stop me beforehand.

I crafted a simple example.

I create a foo-index with a single document:

POST foo_index/foo_type/1
{
  "ts": "2018-10-20T10:00:00Z",
  "value": 10
}

Then I perform a very heavy aggregation query on it:

{
 "query": {
   "bool": {
     "filter": {
       "range": {
         "ts": {
           "gte": "1980-01-01T00:00:00Z",
           "lte": "2019-01-01T00:00:00Z"
         }
       }
     }
   }
 },
 "aggs": {
   "by_ts": {
     "date_histogram": {
       "field": "ts",
       "interval": "10s",
       "extended_bounds": {
         "min": "1980-01-01T00:00:00Z",
         "max": "2020-01-01T00:00:00Z"
       }
     },
     "aggs": {
       "avg_value": {
         "avg": {
           "field": "value"
         }
       }
     }
   }
 }
}

After few seconds, the JVM starts heavy garbage collection:

[2018-09-11T17:44:59,949][WARN ][o.e.m.j.JvmGcMonitorService] [AmQ_BYj] [gc][70]
 overhead, spent [2.4s] collecting in the last [2.6s]
[2018-09-11T17:45:15,822][WARN ][o.e.m.j.JvmGcMonitorService] [AmQ_BYj] [gc][71]
 overhead, spent [14.2s] collecting in the last [15.8s]

And after I while, it crashes with a Java Heap OOM.

Can anybody explain me why ES do not protect itself from this situation, for instance using a circuit breaker?

Edit: I tried ES 6.4.0 (Windows exe and Linux Docker), ES 6.3.1 (Linux Docker) with the same results.

Hi,
We introduced a new cluster setting called search.max_buckets in 6x. It is disabled by default in this version and will default to 10,000 in the next major version (v7):
https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-bucket.html
So in 6x you can set it manually in your cluster in order to protect against these killer queries. It is not set by default in 6x because we considered that it is a breaking change that requires a new version to be introduced. However we issue a deprecation warning in the logs if any aggregations reach the 10,000 limit in 6x. The message explicitly link to the new setting.

2 Likes

Hi Jimczi, thank you for the quick response.

I tried to set the limit and now everything works as intended, When dealing with killer queries, the server throws an exception like the following.

{
  "error": {
    "root_cause": [],
    "type": "search_phase_execution_exception",
    "reason": "",
    "phase": "fetch",
    "grouped": true,
    "failed_shards": [],
    "caused_by": {
      "type": "too_many_buckets_exception",
      "reason": "Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
      "max_buckets": 10000
    }
  },
  "status": 503
}

Does the search.max_buckets apply to composite aggregation too?

Does the search.max_buckets apply to composite aggregation too?

Yes but you can paginate the composite aggregation so the limit should not be a problem. You can retrieve 10,000 composite buckets and then use after option to retrieve the next page of buckets.

Great, that's exactly the way we do it. Thank you very much.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.