Circuit breaker with permanent durability

In our Es 7.5 cluster, a large query I ran tripped some circuit breakers on the data nodes.

       {
        	"took": 8097,
        	"responses": [{
    			"error": {
    				"root_cause": [{
    						"type": "array_index_out_of_bounds_exception",
    						"reason": "Index 33554431 out of bounds for length 528365"
    					}, {
    						"type": "circuit_breaking_exception",
    						"reason": "[parent] Data too large, data for [<transport_request>] would be [30625145952/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30625144224/28.5gb], new bytes reserved: [1728/1.6kb], usages [request=0/0b, fielddata=4059422914/3.7gb, in_flight_requests=1728/1.6kb, accounting=285788997/272.5mb]",
    						"bytes_wanted": 30625145952,
    						"bytes_limit": 30601641984,
    						"durability": "PERMANENT"
    					}, {
    						"type": "circuit_breaking_exception",
    						"reason": "[parent] Data too large, data for [<transport_request>] would be [30676025248/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30676023520/28.5gb], new bytes reserved: [1728/1.6kb], usages [request=0/0b, fielddata=4078935366/3.7gb, in_flight_requests=1728/1.6kb, accounting=286485025/273.2mb]",
    						"bytes_wanted": 30676025248,
    						"bytes_limit": 30601641984,
    						"durability": "PERMANENT"
    					}, {
    						"type": "array_index_out_of_bounds_exception",
    						"reason": "Index 33554431 out of bounds for length 13236"
    					}, {
    						"type": "array_index_out_of_bounds_exception",
    						"reason": "Index 33554431 out of bounds for length 577591"
    					}, {
    						"type": "array_index_out_of_bounds_exception",
    						"reason": "Index 33554431 out of bounds for length 339248"
    					}, {
    						"type": "circuit_breaking_exception",
    						"reason": "[parent] Data too large, data for [<transport_request>] would be [30723147968/28.6gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30723146240/28.6gb], new bytes reserved: [1728/1.6kb], usages [request=0/0b, fielddata=4081225097/3.8gb, in_flight_requests=34608/33.7kb, accounting=287943225/274.6mb]",
    						"bytes_wanted": 30723147968,
    						"bytes_limit": 30601641984,
    						"durability": "PERMANENT"
    					}, {
    						"type": "circuit_breaking_exception",
    						"reason": "[parent] Data too large, data for [<transport_request>] would be [30723147968/28.6gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30723146240/28.6gb], new bytes reserved: [1728/1.6kb], usages [request=0/0b, fielddata=4081225097/3.8gb, in_flight_requests=1728/1.6kb, accounting=287943225/274.6mb]",
    						"bytes_wanted": 30723147968,
    						"bytes_limit": 30601641984,
    						"durability": "PERMANENT"
    					}
...

I've opened up issues about 7.5 circuit breakers before, at this point I am not as much interested in what caused the breaker to be tripped. Instead, I can't figure out how to get the breakers to clear. I've cleared caches, and given the cluster time to recover, but the breakers stay tripped, and I can't make any queries.

Is this a factor of the durability being "PERMANENT"? I have had a very hard time finding any documentation on what that means. Does a permanent circuit breaker mean I have to restart the node to recover from it?

All of our monitoring tells me that these machines are not showing signs of elevated memory usage. They should be healthy, but this circuit breaker appears to be preventing them from responding.

And now I am seeing this:

{
  "_nodes" : {
    "total" : 85,
    "successful" : 25,
    "failed" : 60,
    "failures" : [
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [vWWnWj-tTNywJndl2HjfYw]",
        "node_id" : "vWWnWj-tTNywJndl2HjfYw",
        "caused_by" : {
          "type" : "circuit_breaking_exception",
          "reason" : "[parent] Data too large, data for [<transport_request>] would be [31732088564/29.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [31732063792/29.5gb], new bytes reserved: [24772/24.1kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=24772/24.1kb, accounting=286083803/272.8mb]",
          "bytes_wanted" : 31732088564,
          "bytes_limit" : 30601641984,
          "durability" : "PERMANENT"
        }
      },
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [A5OKzxYtQmWYLG_C7ysMwA]",
        "node_id" : "A5OKzxYtQmWYLG_C7ysMwA",
        "caused_by" : {
          "type" : "circuit_breaking_exception",
          "reason" : "[parent] Data too large, data for [<transport_request>] would be [31128610916/28.9gb], which is larger than the limit of [30601641984/28.5gb], real usage: [31128586144/28.9gb], new bytes reserved: [24772/24.1kb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=24772/24.1kb, accounting=284626965/271.4mb]",
          "bytes_wanted" : 31128610916,
          "bytes_limit" : 30601641984,
          "durability" : "PERMANENT"
        }
      }

Everything is acting very strangely. The cluster reports as green, but when i run a query on indices, they all show up like this:

health status index        uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index1      3Q6XluqUSoO3xJCmWvIujA 180   1                                                  
green  open   index2      idM7VJHzR1WdAIlygfZvuA   1   1                                                  
green  open   index3      lTHtn_tBQLuX90zVU6hTjQ 180   1                                                  

Correct primary and replica counts, but no data at all for doc count, docs.deleted store.size or pri.store.size.

The state it is in now is one I do not understand at all.

A call to node/stats says state all of the data nodes have failed because of the circuit breaker, but resource utilization is low. I've cleared all caches and even deleted half of the data, to no effect.

Despite that, normal searches into the cluster's indices actually work fine, and quickly. But Kibana crashes on startup, outputting errors that mention the "failed nodes".

They are stuck in this state, and I do not understand why. I also do not understand what this state is. The cluster seems convinced it is in a state of failure if you query certain APIs. But other calls make it look like everything is fine. Everything is green, and searches are working.

I am assuming that rebooting all of our data nodes will get us out of this, but I would very much like to understand what is going on. So far in 7.5, the new circuit breaker logic seems to keep putting us in situations that were worse than whatever was happening in 6.4.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.