Circuit Breaker Tripping During Shard Relocation

andrewthad · March 27, 2018, 7:31pm

I have a twenty-node cluster with the hot-cold pattern. I'm running ElasticSearch 6.2.3. Here is the breakdown:

12 Hot Nodes: 16GB memory each, 9GB for JVM heap
8 Cold Nodes: 12GB memory each, 9GB for JVM heap

I'm using the default circuit breaker settings. Lately, when elasticsearch relocates shards to the cold nodes, the circuit breaker has been tripping. This happens constantly. When I check /_cluster/allocation/explain, this is an excerpt what I see:

$ curl http://192.168.254.240:9200/_cluster/allocation/explain?pretty
{
  "index" : "rec-19-58150",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2018-03-27T18:23:12.250Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [65Cu-mwsQi6X_Zf3_JPEMw]: failed recovery, failure RecoveryFailedException[[rec-19-58150][0]: Recovery failed from {allsight-node-slow-5}{9BTs_Ix7S82lVIDoNCHDgg}{8XHHmeuhSRyTn2Gzj5g_-A}{192.168.254.224}{192.168.254.224:9300}{speed=slow} into {allsight-node-slow-4}{65Cu-mwsQi6X_Zf3_JPEMw}{uoQ-H5CyQxyTGAQKQjjkTw}{192.168.254.223}{192.168.254.223:9300}{speed=slow}]; nested: RemoteTransportException[[allsight-node-slow-5][192.168.254.224:9300][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [6753420497/6.2gb], which is larger than the limit of [6752370688/6.2gb]]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "1778qoEdSJeEqC-v5-797w",
      "node_name" : "allsight-node-slow-6",
      "transport_address" : "192.168.254.225:9300",
      "node_attributes" : {
        "speed" : "slow"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-03-27T18:23:12.250Z], failed_attempts[5], delayed=false, details[failed shard on node [65Cu-mwsQi6X_Zf3_JPEMw]: failed recovery, failure RecoveryFailedException[[rec-19-58150][0]: Recovery failed from {allsight-node-slow-5}{9BTs_Ix7S82lVIDoNCHDgg}{8XHHmeuhSRyTn2Gzj5g_-A}{192.168.254.224}{192.168.254.224:9300}{speed=slow} into {allsight-node-slow-4}{65Cu-mwsQi6X_Zf3_JPEMw}{uoQ-H5CyQxyTGAQKQjjkTw}{192.168.254.223}{192.168.254.223:9300}{speed=slow}]; nested: RemoteTransportException[[allsight-node-slow-5][192.168.254.224:9300][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [6753420497/6.2gb], which is larger than the limit of [6752370688/6.2gb]]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        }
      ]
    },
...

I don't understand why the circuit breaker would even trip for relocations. It should just be copying segments over the network, which can be done in constant memory. I would appreciate any help resolving this. Let me know if there is more information I need to provide.

system · April 24, 2018, 7:31pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Circuit_breaking_exception [parent] Elasticsearch	1	224	May 6, 2022
Elasticsearch after upgrade on 7.7.0 starts CircuitBreakingExceptions and nodes leaves/rejoin cluster Elasticsearch	9	720	September 20, 2021
Circuit breaker always trips Elasticsearch	10	3924	December 27, 2017
Circuit breaker exception Elasticsearch Elasticsearch	2	449	September 27, 2019
RemoteTransportException triggered by parent circuit breaker Elasticsearch	1	992	August 23, 2017

Circuit Breaker Tripping During Shard Relocation

Related topics