We've started getting frequent circuit breaker exceptions in our cluster, and I need some advice on how to tackle our situation. I've read up on circuit breakers, read lots of posts on the topic, but I'm still not sure what we need to do.
We're running a 9 node cluster at 7.3.1 with each node at 64 GB of OS level ram and 30 GB heap. All nodes run as Docker containers with no custom GC settings.
Here's an example error message from a basic reindex operation:
elasticsearch.exceptions.TransportError: TransportError(429, '{"took":1025273,"timed_out":false,"total":482123,"updated":469000,"created":0,"deleted":0,"batches":469,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[{"shard":-1,"reason":{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<transport_request>] would be [30609748696/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30609747960/28.5gb], new bytes reserved: [736/736b], usages [request=0/0b, fielddata=35860/35kb, in_flight_requests=736/736b, accounting=524016769/499.7mb]","bytes_wanted":30609748696,"bytes_limit":30601641984,"durability":"PERMANENT"}}]}')
I've started monitoring the breakers and our problem seems to be with the parent breaker. Here's an example where one of our nodes hits the limit for the parent breaker:
Some questions:
-
Would our cluster benefit from reducing the max heap size on each node? All nodes use zero-based compressed oops.
-
Should we add more nodes to the cluster?
-
Should we use a more aggressive GC?
Any tips on how to fix this would be very helpful.