RemoteTransportException triggered by parent circuit breaker

bwdezend · July 26, 2017, 7:02pm

The indices.breaker.total.limit setting in Elasticsearch 5.5.0 seems to trap and break for network transport. We had set the parent breaker very low (20% of heap) trying to debug a different issue, and during the overnight hot->warm shard migration most data nodes reported the following error for every shard:

RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [191b]];
  nested:
    RemoteTransportException[
        [<redacted-nodename>]
        [<redacted-ip-address>:9300]
        [internal:index/shard/recovery/filesInfo]
    ];
  nested:
    CircuitBreakingException[
      [parent] Data too large, data for[<transport_request>] would be [1296430442/1.2gb], which is larger than the limit of [1272787763/1.1gb]
    ]

A number of nodes that reported this error also timed out and left the cluster. When they rejoined, they were able to bring the primary shards back online, but the replica shards had exceeded their index.allocation.max_retries counter, and the cluster proceeded to not re-try these shard allocations. We were able to determine this via use of the cluster allocation explain endpoint (_cluster/allocation/explain?pretty). Increasing the max_retries counter released all the unassigned shards.

We removed the parent circuit breaker setting (allowing the defaults to take over again), but this left me wondering about shard allocation issues in transient failure conditions. Specifically, how fast does the cluster attempt to retry shard allocations? (Does it wait at all between attempts to assign a shard?) I'd personally prefer to set the max_retries to -1, with an exponential backoff for assignment retries, but that doesn't appear to be a supported setting.

system · August 23, 2017, 7:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CircuitBreakingException Data too large Elasticsearch	4	589	August 14, 2020
Questions about index.allocation.max_retries? Elasticsearch	3	488	April 14, 2021
Circuit breaker exception Elasticsearch Elasticsearch	2	449	September 27, 2019
CircuitBreakingException internal:index/shard/recovery/start_recovery Elasticsearch	1	578	June 25, 2018
Shard has exceeded the maximum number of retries Elasticsearch	13	11155	February 24, 2020

RemoteTransportException triggered by parent circuit breaker

Related topics