Request Circuit Breaker keeps tripping; how is the estimate calculated?

We have an ES cluster that is having issues with the circuit breaker. Initially, it seemed tied to our field data. As an experiment, we migrated our index over to a copy with field data turned off. Now, we are still tripping a circuit breaker on transport_request.

The cluster has: 11 data nodes, 1 index, 60 total shards (20 shards: 1 primary, 2 replicas each), about 50 million documents.

  • Individual Cluster node Memory: 10 GB allocated for ES JVM heap (32GB total memory per machine)
  • Request Circuit Breaker limit is set to 80% ( 8 GB of the 10 GB)
  • Elasticsearch 5.1.1
  • Java 1.8.0.121

Upon clean startup of the cluster, things run fine. But usually within a day, we are seeing the circuit breaker trip on transport_request.

[parent] Data too large, data for [<transport_request>] would be larger than limit of [8562042470/7.9gb]

This doesn't jive with the size of our data nor our queries/requests. It seems like ES is miscalculating the circuit-breaking condition.

As an additional experiment, we tried tuning the multiplier for the Request Circuit Breaker. It defaults to 1, so we made it 90% smaller:

indices.request.breaker.overhead: 0.1

Again, the cluster runs fine for a time. But, the same circuit breaker eventually trips.

This feels like a bug with this particular circuit breaker calculation. We should not be coming close to this barrier with any of our requests.

How is the estimate calculated? If we are seeing a bug in the calculation, as inadvisable as it may be, is there a way to turn off just the request breaker (we're pretty confident we won't exceed the heap)? Any other ideas on things we can try?

I suspect you're running into this bug: https://github.com/elastic/elasticsearch/pull/23310, which was fixed in just-released 5.2.2: https://www.elastic.co/blog/elasticsearch-5-2-2-released

The request circuit breaker, which track the size of in-flight requests, was not decrementing its counter when the connection was closed by the client before the response could be returned. This could result in no further requests being accepted until the node has been restarted. All users should upgrade to take advantage of this bug fix.

Thanks for the advice! Anxious to see if 5.2.2 fixes this. Need some time before we can upgrade our cluster, but I will remember to report our results here.

Out of curiosity, do you know in what version this bug was discovered? We've been trying to pinpoint when we started seeing this.

The bug exists in all versions prior to 5.2.2 (5.0.0, 5.0.1, 5.0.2, 5.1.1, 5.1.2, 5.2.0, and 5.2.1). If you can upgrade, would you please report back one way or the other whether or not it fixes your issue?

Are you also collecting stats on your cluster? There is also a serialization bug that impacts your version. I have a theory that the first bug can trigger the second bug. To be clear, I have seen cases of the circuit breaker bug not caused by the serialization bug. I'm only looking for support of my theory.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.