After moving one of our clusters from ES 6.4 to 7.5,we have been seeing frequent instances of shards failing to allocate because they hit the max of 5 retries.
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-01-19T04:02:46.344Z], failed_attempts[5]...
The reason given for the failure generally ends up being circuit breaker related (since the change to include real memory usage in the parent circuit breaker, we have been making adjustments, but circuits are still tripped occasionally).
The circuit breakers are a separate issue, my question for now is: Is there something I can do to remove the need for manual intervention (calling /_cluster/reroute?retry_failed=true) here?
I appreciate that the cluster is trying to protect itself to maintain stability, which is why I have been trying to avoid turning off the new parent circuit breaker logic; however, there's not much the cluster could do that would be worse than just silently, permanently giving up on recovering a replica. That sets us up to lose data if someone is not closely paying attention and is there to force the retry.
Increasing the retry count does not seem like a real fix. I do not know how quickly the retries happen, but whenever I see this issue it generally is affecting several shards at once. I would much prefer it if I could tell the cluster to wait a while and retry again later. Or something along those lines.
Edit: I should note that calling "/_cluster/reroute?retry_failed=true" always seems to do the trick on the first try.