RemoteTransportException triggered by parent circuit breaker

The indices.breaker.total.limit setting in Elasticsearch 5.5.0 seems to trap and break for network transport. We had set the parent breaker very low (20% of heap) trying to debug a different issue, and during the overnight hot->warm shard migration most data nodes reported the following error for every shard:

RecoverFilesRecoveryException[Failed to transfer [1] files with total size of [191b]];
  nested:
    RemoteTransportException[
        [<redacted-nodename>]
        [<redacted-ip-address>:9300]
        [internal:index/shard/recovery/filesInfo]
    ];
  nested:
    CircuitBreakingException[
      [parent] Data too large, data for[<transport_request>] would be [1296430442/1.2gb], which is larger than the limit of [1272787763/1.1gb]
    ]

A number of nodes that reported this error also timed out and left the cluster. When they rejoined, they were able to bring the primary shards back online, but the replica shards had exceeded their index.allocation.max_retries counter, and the cluster proceeded to not re-try these shard allocations. We were able to determine this via use of the cluster allocation explain endpoint (_cluster/allocation/explain?pretty). Increasing the max_retries counter released all the unassigned shards.

We removed the parent circuit breaker setting (allowing the defaults to take over again), but this left me wondering about shard allocation issues in transient failure conditions. Specifically, how fast does the cluster attempt to retry shard allocations? (Does it wait at all between attempts to assign a shard?) I'd personally prefer to set the max_retries to -1, with an exponential backoff for assignment retries, but that doesn't appear to be a supported setting.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.