Large buckets aggregations

Hi there,
We are testing Elasticsearch to compute large aggregations.

Our queries consists of aggregation that return large number of buckets that we then use in other pipeline aggregations to get the results we need.

Our slowest queries runs in ~10sec on my laptop Can this be optimized on a production environment?

If so what configuration parameters should I tweak, and what would be a
an appropriate server configuration for such a usage.

Here is a simplified example of one of our queries

{
  "size": 0,    // skip query result
  "query": { }, // some query skipped for this example
  "aggs": {         
      "my_large_agg": { // create a large aggregations returning ~100k buckets, we are not really interesseted in the raw results but are using the results below in a sum_bucket aggregation
        "terms": {
          "field": "some_id",
          "size": 0
        },
        "aggs": {
          "my_large_agg_avg": {
            "avg": {
              "field": "some_field"
            }
          }
        }
      },
      "my_large_agg_sum": { // bucket aggregation returning the result we want to return to the end user or use in other piped aggregation
        "sum_bucket": {
          "buckets_path": "my_large_agg.my_large_agg_avg"
        }
      }
    }
  }
}

The problem with this approach is that its requires, at search time, that all ids are grouped together and sent back to the coordinating node to then be summed. This does not scale well with either memory usage or CPU and you will probably find, as your data grows, that you run in to bigger problems like OOM errors on your nodes or hit circuit breaker exceptions. In fact we do not recommend you use size: 0 with the terms aggregation at all because of these scaling problems.

Since this data needs to be grouped at some point you might find it more beneficial to do the grouping of each id at index time and create a 'profile' for each 'entity' (id) which contains the average of 'some_field' (together with any other 'entity centric' information and is updated as events come in. You would then have your current 'event-centric' index and also an 'entity-centric' index with each document representing a value of 'some_id'. The result you require here would then be a simple and fast sum aggregation.

You can find more information on the idea of 'entity-centric' indexing here: https://www.elastic.co/videos/entity-centric-indexing-mark-harwood

Hope that helps

Thanks a lot for your detailed answer Colin.
I am going to have a look at the 'entity-centric' indexing.