In our Es 7.5 cluster, a large query I ran tripped some circuit breakers on the data nodes.
{
"took": 8097,
"responses": [{
"error": {
"root_cause": [{
"type": "array_index_out_of_bounds_exception",
"reason": "Index 33554431 out of bounds for length 528365"
}, {
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<transport_request>] would be [30625145952/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30625144224/28.5gb], new bytes reserved: [1728/1.6kb], usages [request=0/0b, fielddata=4059422914/3.7gb, in_flight_requests=1728/1.6kb, accounting=285788997/272.5mb]",
"bytes_wanted": 30625145952,
"bytes_limit": 30601641984,
"durability": "PERMANENT"
}, {
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<transport_request>] would be [30676025248/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30676023520/28.5gb], new bytes reserved: [1728/1.6kb], usages [request=0/0b, fielddata=4078935366/3.7gb, in_flight_requests=1728/1.6kb, accounting=286485025/273.2mb]",
"bytes_wanted": 30676025248,
"bytes_limit": 30601641984,
"durability": "PERMANENT"
}, {
"type": "array_index_out_of_bounds_exception",
"reason": "Index 33554431 out of bounds for length 13236"
}, {
"type": "array_index_out_of_bounds_exception",
"reason": "Index 33554431 out of bounds for length 577591"
}, {
"type": "array_index_out_of_bounds_exception",
"reason": "Index 33554431 out of bounds for length 339248"
}, {
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<transport_request>] would be [30723147968/28.6gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30723146240/28.6gb], new bytes reserved: [1728/1.6kb], usages [request=0/0b, fielddata=4081225097/3.8gb, in_flight_requests=34608/33.7kb, accounting=287943225/274.6mb]",
"bytes_wanted": 30723147968,
"bytes_limit": 30601641984,
"durability": "PERMANENT"
}, {
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<transport_request>] would be [30723147968/28.6gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30723146240/28.6gb], new bytes reserved: [1728/1.6kb], usages [request=0/0b, fielddata=4081225097/3.8gb, in_flight_requests=1728/1.6kb, accounting=287943225/274.6mb]",
"bytes_wanted": 30723147968,
"bytes_limit": 30601641984,
"durability": "PERMANENT"
}
...
I've opened up issues about 7.5 circuit breakers before, at this point I am not as much interested in what caused the breaker to be tripped. Instead, I can't figure out how to get the breakers to clear. I've cleared caches, and given the cluster time to recover, but the breakers stay tripped, and I can't make any queries.
Is this a factor of the durability being "PERMANENT"? I have had a very hard time finding any documentation on what that means. Does a permanent circuit breaker mean I have to restart the node to recover from it?
All of our monitoring tells me that these machines are not showing signs of elevated memory usage. They should be healthy, but this circuit breaker appears to be preventing them from responding.