Request Circuit Breaker keeps tripping; how is the estimate calculated?

spltscreen · March 2, 2017, 4:55pm

We have an ES cluster that is having issues with the circuit breaker. Initially, it seemed tied to our field data. As an experiment, we migrated our index over to a copy with field data turned off. Now, we are still tripping a circuit breaker on transport_request.

The cluster has: 11 data nodes, 1 index, 60 total shards (20 shards: 1 primary, 2 replicas each), about 50 million documents.

Individual Cluster node Memory: 10 GB allocated for ES JVM heap (32GB total memory per machine)
Request Circuit Breaker limit is set to 80% ( 8 GB of the 10 GB)
Elasticsearch 5.1.1
Java 1.8.0.121

Upon clean startup of the cluster, things run fine. But usually within a day, we are seeing the circuit breaker trip on transport_request.

[parent] Data too large, data for [<transport_request>] would be larger than limit of [8562042470/7.9gb]

This doesn't jive with the size of our data nor our queries/requests. It seems like ES is miscalculating the circuit-breaking condition.

As an additional experiment, we tried tuning the multiplier for the Request Circuit Breaker. It defaults to 1, so we made it 90% smaller:

indices.request.breaker.overhead: 0.1

Again, the cluster runs fine for a time. But, the same circuit breaker eventually trips.

This feels like a bug with this particular circuit breaker calculation. We should not be coming close to this barrier with any of our requests.

How is the estimate calculated? If we are seeing a bug in the calculation, as inadvisable as it may be, is there a way to turn off just the request breaker (we're pretty confident we won't exceed the heap)? Any other ideas on things we can try?

polyfractal · March 2, 2017, 10:55pm

I suspect you're running into this bug: Ensure that releasing listener is called by jasontedor · Pull Request #23310 · elastic/elasticsearch · GitHub, which was fixed in just-released 5.2.2: Elasticsearch 5.2.2 released | Elastic Blog

The request circuit breaker, which track the size of in-flight requests, was not decrementing its counter when the connection was closed by the client before the response could be returned. This could result in no further requests being accepted until the node has been restarted. All users should upgrade to take advantage of this bug fix.

spltscreen · March 6, 2017, 3:46pm

Thanks for the advice! Anxious to see if 5.2.2 fixes this. Need some time before we can upgrade our cluster, but I will remember to report our results here.

Out of curiosity, do you know in what version this bug was discovered? We've been trying to pinpoint when we started seeing this.

jasontedor · March 6, 2017, 6:30pm

The bug exists in all versions prior to 5.2.2 (5.0.0, 5.0.1, 5.0.2, 5.1.1, 5.1.2, 5.2.0, and 5.2.1). If you can upgrade, would you please report back one way or the other whether or not it fixes your issue?

Are you also collecting stats on your cluster? There is also a serialization bug that impacts your version. I have a theory that the first bug can trigger the second bug. To be clear, I have seen cases of the circuit breaker bug not caused by the serialization bug. I'm only looking for support of my theory.

system · April 3, 2017, 6:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Circuit breaker always trips Elasticsearch	10	4053	December 27, 2017
Circuit breaker Open for http_request Elasticsearch	2	609	January 14, 2018
Circuit Breaker [parent] Data too large, data for [<http_request>] Elasticsearch	2	3558	August 7, 2017
CircuitBreakingException on my ES cluster Elasticsearch	5	1167	December 2, 2019
CircuitBreakingException - Data too large Elastic Tips and Common Fixes elasticsearch	1	6042	November 4, 2022

Request Circuit Breaker keeps tripping; how is the estimate calculated?

Related topics