Control number of buckets created in an aggregation

This question was originally posted on StackOverflow:

I'm having trouble with controlling the number of buckets an aggregation is going to create.

This is a simple aggregation containing no nested or sub aggregations on an index having roughly 80000 documents:

GET /my_index/_search
{
   "size":0,
   "query":{
      "match_all":{}
   },
   "aggregations":{
      "unique":{
         "terms":{
            "field":"_id",
            "size":<NUM_TERM_BUCKETS>
         }
      }
   }
}

If I set the <NUM_TERM_BUCKETS> to 7000, I get this error response in ES 7.3:

{
   "error":{
      "root_cause":[
         {
            "type":"too_many_buckets_exception",
            "reason":"Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
            "max_buckets":10000
         }
      ],
      "type":"search_phase_execution_exception",
      "reason":"all shards failed",
      "phase":"query",
      "grouped":true,
      "failed_shards":[
         {
            "shard":0,
            "index":"my_index",
            "node":"XYZ",
            "reason":{
               "type":"too_many_buckets_exception",
               "reason":"Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
               "max_buckets":10000
            }
         }
      ]
   },
   "status":503
}

And it runs successfully if I decrease the <NUM_TERM_BUCKETS> to 6000.

I'm really confused. how on earth this aggregation creates more than 10000 buckets?

To address issues of accuracy in a distributed system elasticsearch asks for a number higher than ‘size’ from each shard (see ‘shard_size’ setting).
If you have a lot of unique values, making multiple requests using the ‘composite’ aggregation is probably a better way to go

Is this a documented behavior?

Here's the response of GET /my_index/_settings:

{
  "my_index" : {
    "settings" : {
      "index" : {
        "creation_date" : "1564578971559",
        "number_of_shards" : "1",
        "number_of_replicas" : "1",
        "uuid" : "KXlMMbZpT-yX8AaQK1FI7w",
        "version" : {
          "created" : "6080299",
          "upgraded" : "7030099"
        },
        "provided_name" : "my_index"
      }
    }
  }
}

i.e. my_index has only one shard.

Works OK on 6.x - I'll need to dig deeper into what changed in 7.x.
Looks like the shard_size multiplier is still in effect for a single-sharded system.
The good news is if you also set shard_size to 8000 it should work OK (does for me here).

FYI - not sure if your example with "unique" agg on "_id" field is representative but if you just want to count docs this will be much cheaper:

GET my_index/_count
1 Like

Replying to myself - this shard_size = size optimisation for single-sharded systems changed when we introduced cross-cluster search. A cluster could no longer know the total number of shards in a request so this RAM-saving optimisation was removed

1 Like

Thanks. it was helpful.