[SOLVED] Weird cluster state after circuit_breaking_exception in 5.2.1

leucos · March 8, 2017, 8:27am

Hi,

We use filebeat here and ingest nodes for logs aggregation on a 5.2.1 3 nodes cluster deployed on RH 7.3 servers. We recently experienced the following:

$ curl node0:9200?pretty
{
  "error" : {
    "root_cause" : [
      {
        "type" : "circuit_breaking_exception",
        "reason" : "[parent] Data too large, data for [<http_request>] would be larger than limit of [1491035750/1.3gb]",
        "bytes_wanted" : 1491058680,
        "bytes_limit" : 1491035750
      }
    ],
    "type" : "circuit_breaking_exception",
    "reason" : "[parent] Data too large, data for [<http_request>] would be larger than limit of [1491035750/1.3gb]",
    "bytes_wanted" : 1491058680,
    "bytes_limit" : 1491035750
  },
  "status" : 503
}

All nodes were sending this reply when queried with a GET on '/'.

We are still investigating what sent such a large request. But the real problem was the weird cluster state afterwards. We still had the three nodes showing up in cluster health (and green):

$ curl -s http://node0:9200/_cluster/health?pretty
{
  "cluster_name" : "log_preprod",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 6,
  "active_shards" : 12,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

However node stats were showing only one node:

$ curl -s http://node0:9200/_nodes/stats?pretty | jq '.nodes |keys' 
[
  "9aVJKZEFQcKL2tWFbUDYsg"
]

We had to restart all nodes in the cluster to get it back to work.

Any clue on what happened here ?

Thanks

M

leucos · March 8, 2017, 4:37pm

Seems linked to Request Circuit Breaker keeps tripping; how is the estimate calculated?

spltscreen · March 15, 2017, 5:38pm

This definitely has some similarities to the problem we were seeing that that thread. The recommendation there was to upgrade to 5.2.2.

We haven't had time to do that yet. In the interim, we are using (misusing?) a setting that lets us trick the circuit breaker, as described in that thread. We made it a really tiny number (0.0001). I'm sure there are probably a lot of good reasons not to do that, but it is allowing our cluster to stay operational for longer periods while we work out a plan to upgrade to 5.2.2.

leucos · March 15, 2017, 6:02pm

We upgraded our cluster to 5.2.2 today. I will report back next week on the outcome (it takes 4-5 days for the cluster to go nuts).
Thanks for the heads up on the setting, I overlooked that. Might come in handy if the problem did not go away.

leucos · March 21, 2017, 8:18am

I confirm that upgrading to 5.2.2 solves the problem in our case.

system · April 18, 2017, 8:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Circuit breaker Open for http_request Elasticsearch	2	611	January 14, 2018
Circuit Breaker [parent] Data too large, data for [<http_request>] Elasticsearch	2	3566	August 7, 2017
CircuitBreakingException causes nodes to leave the cluster Elasticsearch	1	454	May 7, 2018
Circuit_breaking_exception Elasticsearch	6	2198	February 2, 2018
Request Circuit Breaker keeps tripping; how is the estimate calculated? Elasticsearch	4	5359	April 3, 2017

[SOLVED] Weird cluster state after circuit_breaking_exception in 5.2.1

Related topics