[SOLVED] Weird cluster state after circuit_breaking_exception in 5.2.1


We use filebeat here and ingest nodes for logs aggregation on a 5.2.1 3 nodes cluster deployed on RH 7.3 servers. We recently experienced the following:

$ curl node0:9200?pretty
  "error" : {
    "root_cause" : [
        "type" : "circuit_breaking_exception",
        "reason" : "[parent] Data too large, data for [<http_request>] would be larger than limit of [1491035750/1.3gb]",
        "bytes_wanted" : 1491058680,
        "bytes_limit" : 1491035750
    "type" : "circuit_breaking_exception",
    "reason" : "[parent] Data too large, data for [<http_request>] would be larger than limit of [1491035750/1.3gb]",
    "bytes_wanted" : 1491058680,
    "bytes_limit" : 1491035750
  "status" : 503

All nodes were sending this reply when queried with a GET on '/'.

We are still investigating what sent such a large request. But the real problem was the weird cluster state afterwards. We still had the three nodes showing up in cluster health (and green):

$ curl -s http://node0:9200/_cluster/health?pretty
  "cluster_name" : "log_preprod",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 6,
  "active_shards" : 12,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0

However node stats were showing only one node:

$ curl -s http://node0:9200/_nodes/stats?pretty | jq '.nodes |keys' 

We had to restart all nodes in the cluster to get it back to work.

Any clue on what happened here ?



Seems linked to Request Circuit Breaker keeps tripping; how is the estimate calculated?

This definitely has some similarities to the problem we were seeing that that thread. The recommendation there was to upgrade to 5.2.2.

We haven't had time to do that yet. In the interim, we are using (misusing?) a setting that lets us trick the circuit breaker, as described in that thread. We made it a really tiny number (0.0001). I'm sure there are probably a lot of good reasons not to do that, but it is allowing our cluster to stay operational for longer periods while we work out a plan to upgrade to 5.2.2.

We upgraded our cluster to 5.2.2 today. I will report back next week on the outcome (it takes 4-5 days for the cluster to go nuts).
Thanks for the heads up on the setting, I overlooked that. Might come in handy if the problem did not go away.

I confirm that upgrading to 5.2.2 solves the problem in our case.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.