Finding out cause of circuit breaking exception (Data too large)

Hi,

I'm getting the error below occasionally, and I'm trying to find the cause of it so I can try to resolve it.

circuit_breaking_exception: [parent] Data too large, data for [<http_request>] would be [32620387928/30.3gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32620387928/30.3gb], new bytes reserved: [0/0b], usages [eql_sequence=0/0b, fielddata=24272253421/22.6gb, request=0.0b, inflight_requests=0/0b, model_inference=0.0b]

I'm running ES v8.8.0, with multiple instances of filebeat writing to a single index, at the indexing rate of about 160K/s. The ES cluster is on Kubernetes, with 3 master ES nodes and 40 data nodes (1.3TB of memory (usually at 50-60% usage), 126TB/600TB of SSD space free). Once I hit the error message above, I won't be able to access the Kibana/ES anymore, and would have to restart the ES nodes.

One strange thing I just happened to notice was when I was got the error, the indexing rate of the index currently being written to got up to an impossibly high number, like 9-digit (on the Stack Monitoring --> Indices page, time window: last 2 min). Then it drops to 2+M, which is still unusually high since I'm expecting at most ~160K/s. Not sure if this has anything to do with the error.

Is there a way to find out what is causing this issue, such as some complex query running, or the ingest rate is too high (I have gotten this error at a lower indexing rate of ~100K/s too), etc?

So far, I've noticed that the real usage is not that much higher than the limit. Would increasing the limit from 29.4gb to, say, 32gb be possible and feasible?

Thank you.

Hi @hjazz6,

Thanks for sharing. Are you using fielddata on text fields at all as per the documentation? It might be worth checking the state of the breakers using the stats API.

Feel free to share the output and your cluster settings and we can see if we can help further.

Hi @carly.richmond ,

Thanks for replying.

Out of the 102 fields I have, 9 are "text" fields.

I'm not sure how to interpret the output of GET _nodes/stats/breaker, and I can't copy and paste the entire output as my ES is on another (offline) machine. Perhaps you can let me know which fields I should be looking at?

Here is a portion of the output (of one of the nodes):

"attributes": {
    ...
    "ml.allocated_processors_double": "12.0",
    "data": "warm",
    "ml.allocated_processors": "12",
    "ml.machine_memory": "68719476736",
    "ml.max_jvm_size": "33285996544"
},
"breakers": {
    "eql_sequence": {
        "limit_size_in_bytes": 16642998272,
        "limit_size": "15.5gb",
        "estimated_size_in_bytes": 0,
        "estimated_size": "0b",
        "overhead": 1,
        "tripped": 0
    },
    "fielddata": {
        "limit_size_in_bytes": 13314398617,
        "limit_size": "12.3gb",
        "estimated_size_in_bytes": 0,
        "estimated_size": "0b",
        "overhead": 1.03,
        "tripped": 0
    },
    "request": {
        "limit_size_in_bytes": 19971597926,
        "limit_size": "18.5gb",
        "estimated_size_in_bytes": 0,
        "estimated_size": "0b",
        "overhead": 1,
        "tripped": 0
    },
    "inflight_requests": {
        "limit_size_in_bytes": 33285996544,
        "limit_size": "31gb",
        "estimated_size_in_bytes": 7412747,
        "estimated_size": "7mb",
        "overhead": 2,
        "tripped": 0
    },
    "model_inference": {
        "limit_size_in_bytes": 16642998272,
        "limit_size": "15.5gb",
        "estimated_size_in_bytes": 0,
        "estimated_size": "0b",
        "overhead": 1,
        "tripped": 0
    },
    "parent": {
        "limit_size_in_bytes": 31621696716,
        "limit_size": "29.4gb",
        "estimated_size_in_bytes": 16919285712,
        "estimated_size": "15.7gb",
        "overhead": 1,
        "tripped": 0
    }
}

All tripped values of all nodes are 0.

One observation I made was when the circuit breaking exception happened yesterday again, the JVM heap memory usage was at least 85%.

Hi @hjazz6,

It would be looking at the tripped attribute for the breakers to find the number of times it has been triggered: If you check out the breakers property further down in the documentation a description of each field is given.

[<http_request>] would be [32620387928/30.3gb], which is larger than the limit of [31621696716/29.4gb]

The circuit breaking exception is an indication of JVM pressure so it does make sense that you observed high usage. Changing the limits isn't really recommended as they are designed to prevent OOM errors.

Do you know what kind of request was being processed when you saw the circuit breaker trigger yesterday? It might be worth checking the Elasticsearch logs and potentially the slow log to see what activity was happening around that time? Then you can see if you can split the request.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.