We have an ES cluster that is having issues with the circuit breaker. Initially, it seemed tied to our field data. As an experiment, we migrated our index over to a copy with field data turned off. Now, we are still tripping a circuit breaker on transport_request.
The cluster has: 11 data nodes, 1 index, 60 total shards (20 shards: 1 primary, 2 replicas each), about 50 million documents.
- Individual Cluster node Memory: 10 GB allocated for ES JVM heap (32GB total memory per machine)
- Request Circuit Breaker limit is set to 80% ( 8 GB of the 10 GB)
- Elasticsearch 5.1.1
- Java 1.8.0.121
Upon clean startup of the cluster, things run fine. But usually within a day, we are seeing the circuit breaker trip on transport_request.
[parent] Data too large, data for [<transport_request>] would be larger than limit of [8562042470/7.9gb]
This doesn't jive with the size of our data nor our queries/requests. It seems like ES is miscalculating the circuit-breaking condition.
As an additional experiment, we tried tuning the multiplier for the Request Circuit Breaker. It defaults to 1, so we made it 90% smaller:
indices.request.breaker.overhead: 0.1
Again, the cluster runs fine for a time. But, the same circuit breaker eventually trips.
This feels like a bug with this particular circuit breaker calculation. We should not be coming close to this barrier with any of our requests.
How is the estimate calculated? If we are seeing a bug in the calculation, as inadvisable as it may be, is there a way to turn off just the request breaker (we're pretty confident we won't exceed the heap)? Any other ideas on things we can try?