Aggregations blowing up client node (OOM)

We have a daily index with about 30K unique devices.
From Kibana we request the unique counts per day, per version.
If the date range is large enough, the client node gets an out of memory, and the node should be restarted.

I saw that it is still not possible to have a request breaker for a client node (see
Improved Request Circuit Breaker

The query that is done by Kibana is basically an _msearch, with a lot of indexes. I guess that this is one part of the problem (requesting them all together)

My question is: what options do I have to limit the memory usage for the aggregations?

Below is the query:

{
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "*",
          "analyze_wildcard": true
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "query": {
                "query_string": {
                  "query": "Country:CZ",
                  "analyze_wildcard": true
                }
              }
            },
            {
              "range": {
                "ts": {
                  "gte": 1456483506632,
                  "lte": 1457088306632,
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      }
    }
  },
  "size": 0,
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "ts",
        "interval": "1h",
        "time_zone": "GMT+0",
        "min_doc_count": 1,
        "extended_bounds": {
          "min": 1456483506626,
          "max": 1457088306626
        }
      },
      "aggs": {
        "3": {
          "terms": {
            "field": "version",
            "size": 5,
            "order": {
              "1": "desc"
            }
          },
          "aggs": {
            "1": {
              "cardinality": {
                "field": "deviceid",
                "precision_threshold": 30000
              }
            }
          }
        }
      }
    }
  }
}

To me it looks like you are requesting it per hour and not per day as you describe, which will probably result in a large number of buckets.

Yes, you are right. But even with '1d' as interval, the problem occurs, it only happens later.
Yes, limiting the #buckets is of course the first possibility, but I was looking if there are options to make the whole thing more efficient. Like maybe with execution_options.

As you are grouping by device ID and are using a cardinality aggregation with high precision, have you considered using routing based on the device ID when indexing in order to get all records related to a single device ID located in a single shard?

Yes, I did. But I didn't implement it, because I wasn't sure if that will help.
Is ES smart enough to 'know' that it can simply add the counts because the tokens are unique per shard?

How many versions are there? How many do you typically see in a single day?

Sorry, I just saw you that the OOM is happening on the client node, not the data nodes. Instead of my previous question, could you tell us how many shards you are searching across? how many shards per daily index?

Also in your original post you said

How big does the date range need to be to see these OOM errors? In your query you are searching over 7 days, does this cause the OOM?

Note: The following is written assuming you are hitting OOM with the exact query you provided and without any knowledge of your ES version, current heap size, number of nodes, number of shards per daily index, or range of dates you are searching over. I have tried to show as much workings as possible so you can adapt things if some/all of my assumptions are not true.

To explain a little bit of why this aggregation might be taking up so much memory the following is a rough calculation of the memory the cardinality aggregation will require (note that the terms aggregation and date_histogram aggregation will also have a memory footprint but I am ignoring that for now):

Memory used by the cardinality aggregation for a single bucket[1]: 8 * precision_threshold bytes = 8 * 30000 bytes = 240,000 bytes = 234.375 kB

The histogram aggregation is asking for D days of data and you are returning 5 versions for each day, so the memory required to caculate the cardinalty across all the buckets is:

D * 5 * 234.375 kB = D * 1.144MB.

This is the memory footprint of the result of the cardinality aggregation returned by each shard. So if you have S shards per daily index, the client node will need:

(S + 1) * D * 1.144MB

Note: Since your indices are daily each shard should only return a single date histogram bucket to the client node. The extra 1 is for the resulting data structure after the shard results are reduced into the final result.

So if S is 5 (the default) and D is 365 the cardinality aggregation will require ~2GB of memory across all the buckets. If you have a small heap size set on your nodes (e.g. 4GB) I could see this causing an OOM especially if these queries are being run concurrently by different users.

The terms aggregation does have a collect_mode option and an execution_hint option which are designed to reduce the memory usage in particular situations but unfortunately they will not help here as they are options for reducing memory usage on the shard collection stage and not the reduce stage.

The options I see is to do one (or more) of:

  • Limit the number of buckets that can be produced by the request; e.g. limit your request to less days worth of data.
  • Limit the memory usage of the cardinality aggregation in each individual bucket by reducing the precision_threshold. This will use less memory per bucket at the cost of some precision.
  • Increase ES_HEAP_SIZE on your nodes to accomodate the memory pressure. Note that if you are already running nodes with 30GB heaps you will probably need to start more client nodes and spread the incoming requests among them

Hope that helps

[1] https://www.elastic.co/guide/en/elasticsearch/reference/2.2/search-aggregations-metrics-cardinality-aggregation.html#_counts_are_approximate

Hi Colin,

Thanks for your detailed answer.
We are running ES2.10, and we have a daily 6 shard index.
The client runs with a heap of 15GB.

Some thoughts:

  1. I can see that this could simply result in an OOM.
    However, I expect that this is also cause by Kibana doing the _msearch all together. In this way you force ES to combine all the individual answers from the shards, while I guess that the end-result that Kibana will use for the display is relative limited.

  2. For problems like these it is really necessary that there is a circuit-breaker. The client is many time unrecoverable damaged and need to be restarted.
    I currently have no option to limit the users in their intervals/#days requested.

  3. Question: Christian mentioned routing on deviceid. I actually considered doing routing, but I hesitated to implement this.
    ES could be smart when combining aggregations when you know the routing in advance.
    What is your opinion on this?

Kind regards,
Peter

Sorry its taken a while to reply, I've had some holiday recently and have only just picked this up.

We would like to add a circuit breaker for aggregation buckets created during a request. However, determining how much memory a bucket will consume is non-trivial and we are yet to find a method to calculate the cost of an aggregation bucket that works reliably and doesn't adversely affect the performance of aggregations. I know @dakrone has been thinking about a different approach to creating a circuit breaker for this too. However, the majority of this work is currently focused on adding a circuit breaker to the shard level collection phase since this is where we see the majority of OOM exceptions with aggregations.

It could, and this might be worth exploring. The trick here would be knowing that you are aggregating on a routing key and this optimisation can be applied, since ES doesn't store what you are routing on (hence why you have to supply a routing key on all requests if you are using custom routing).

Hope this helps, and sorry again for the long delay before replying