Aggregate query: Elasticsearch:java.lang.OutOfMemoryError: Java heap space

I have a single node on a 32GB RAM machine, running ES 5.6.2, which is collecting a reasonable volume of nginx logs. It's just about working well enough and we are in the process of moving over to a 3-node cluster with up-to-date software.

In the meantime, I have a single report that I ideally need to be able to run, doing a unique count aggregation across 30days of data.

I can juuuuuuust about get it to run, usually by running 1 day, 7 days, 14 days, 21 days and finally 30days (does that even make sense?)... but more often than not it still falls over with the classic OOM error.

Sometimes it just shows a timeout error, but normally it full-on crashes and I have to manually start ES again.

Extensive googling around lead me to increasing the heap space from 2GB to 12GB - in fact, it was only doing that which resulted it being able to run the report at all. I tried it at 50% RAM which is 16GB, but that seemed less stable as I believe logstash is also using plenty of RAM.

Honestly, I'm mostly frustrated at not finding any hints, docs or info about what to do with the OOM error other than increase the heap space. What's next? What should I be reading? What debugging can I do to understand why this query crashes ES? Is there anything I can tweak just to get by for a couple of weeks?

1 Like

Hi @ErisDS and welcome!

OOM errors are tricky to diagnose, and the most reliable way to get answers is to take the heap dump written as the process exits and look at it (e.g. in MAT) to see what's taking up so much heap. Or, with experience, to look at the search and spot why it's inefficient. More recent versions of Elasticsearch have better protection against OOMs.

Can you share the search you're trying to run here? Maybe there's a different way to get the same answers that isn't as expensive.

Is there a guide to getting my hands on the heapdump somewhere?

This is the search:

{
  "title": "Unique IP Report",
  "type": "table",
  "params": {
    "perPage": 25,
    "showMeticsAtAllLevels": false,
    "showPartialRows": false,
    "showTotal": false,
    "sort": {
      "columnIndex": null,
      "direction": null
    },
    "totalFunc": "sum",
    "type": "table"
  },
  "aggs": [
    {
      "id": "1",
      "enabled": true,
      "type": "cardinality",
      "schema": "metric",
      "params": {
        "field": "nginx.access.remote_ip"
      }
    },
    {
      "id": "2",
      "enabled": true,
      "type": "terms",
      "schema": "bucket",
      "params": {
        "field": "nginx.access.host",
        "size": 8000,
        "order": "desc",
        "orderBy": "1"
      }
    }
  ],
  "listeners": {}
}

I think it's probably pure volume of data... but using count metrics works fine - it's just this unique count metric that doesn't.

I am also curious because if I run the query successfully and then run the exact same query again (with the exact same date range) I'd expect some sort of caching factor to kick in, but it doesn't - the whole thing must get calculated again because it can crash elasticsearch on the 2nd or 3rd run.

The manual has some information about the location of the heap dumps.

Do other working searches involve such a large terms aggregation? Does this search still fail if you reduce the size? I don't know the details intimately, but I know that large terms aggregations are worth avoiding, so that's where I'd start investigating.

Yes, I've performed searches that are equally large, e.g. if I count remote_ips it's fine, only count unique fails.

Is there a way to tune performance for the unique aggregation? I only need it on this one field.

I'm not really sure what you mean by "unique count" - it's a phrase you've used a few times but it's not something that I recognise from the Elasticsearch side.

Can you give examples of equivalently large searches that you are able to perform successfully?

Hopefully a screenshot is worth a 1000 words

image

Count works fine, Unique Count, which is the aggregation I need & seems to be called cardinality in the search, is the one that crashes ES.

The cardinality aggregation is notorious for causing memory issues when nested under a high cardinality term.
It can use a modest amount of memory calculating unique counts but when multiplied by a high number of parent terms this adds up to a lot.
Fortunately you can tune the amount of memory used per count at the cost of the accuracy of that count. This is called the "precision_threshold" where it switches from a strategy of keeping a set of term hashes up to this size and switches to a fuzzier probabilistic way of counting unique values. The default threshold value is 3000 but you can lower it to make big memory gains. In Kibana it looks like this:

Kibana

(Note - in my test I was using elastic stack 7.2 which didn't go into meltdown. The circuit breaker kicked in with a memory warning and rejected a query similar to yours. Adding the precision_threshold avoided the error but the point is bad queries are handled better with newer versions of the stack).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.