indices.breaker.total.limit setting in Elasticsearch 5.5.0 seems to trap and break for network transport. We had set the parent breaker very low (20% of heap) trying to debug a different issue, and during the overnight hot->warm shard migration most data nodes reported the following error for every shard:
RecoverFilesRecoveryException[Failed to transfer  files with total size of [191b]]; nested: RemoteTransportException[ [<redacted-nodename>] [<redacted-ip-address>:9300] [internal:index/shard/recovery/filesInfo] ]; nested: CircuitBreakingException[ [parent] Data too large, data for[<transport_request>] would be [1296430442/1.2gb], which is larger than the limit of [1272787763/1.1gb] ]
A number of nodes that reported this error also timed out and left the cluster. When they rejoined, they were able to bring the primary shards back online, but the replica shards had exceeded their
index.allocation.max_retries counter, and the cluster proceeded to not re-try these shard allocations. We were able to determine this via use of the cluster allocation explain endpoint (
_cluster/allocation/explain?pretty). Increasing the
max_retries counter released all the unassigned shards.
We removed the parent circuit breaker setting (allowing the defaults to take over again), but this left me wondering about shard allocation issues in transient failure conditions. Specifically, how fast does the cluster attempt to retry shard allocations? (Does it wait at all between attempts to assign a shard?) I'd personally prefer to set the
-1, with an exponential backoff for assignment retries, but that doesn't appear to be a supported setting.