Hey guys,
We have a cluster running on version 6.5.4.
When we look at the unassigned shards it shows the following shard for example:
{
"index" : "facebook-post-comment_v1",
"shard" : 3,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2019-02-25T11:04:27.016Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [VR2ChTZKSWu1Do1NFQawWQ]: failed recovery, failure RecoveryFailedException[[facebook-post-comment_v1][3]: Recovery failed from {es73}{rzne2XGKQ0CCk_1I3ZGOUg}{mXiPfm07TWGgLlft4XYx0w}{192.168.1.73}{192.168.1.73:9300}{xpack.installed=true} into {es84}{VR2ChTZKSWu1Do1NFQawWQ}{qvBmNF6vRCOckUHZLt-Mxw}{192.168.1.84}{192.168.1.84:9300}{xpack.installed=true}]; nested: RemoteTransportException[[es73][192.168.1.73:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: RemoteTransportException[[es84][192.168.1.84:9300][internal:index/shard/recovery/translog_ops]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [5965608074/5.5gb], which is larger than the limit of [5964143001/5.5gb], usages [request=0/0b, fielddata=5069497235/4.7gb, in_flight_requests=164784/160.9kb, accounting=895946055/854.4mb]]; ",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
...
}
As much as I understand, it says that it failed to allocate for 5 times on Feb 25. So I thought I need to fix the problem and retry the allocation. The solution was to clear the cache, for getting rid of the CircuitBreakingException, for exceeding the limits. I did it on all nodes and there is no more CircuitBreakingExceptions on my nodes and when I check the breaker limits all nodes' parent breaker limits are below 1.2GB out of 5.5GB.
Then, I run the allocation command manually with the retry_failed=true but it shows the following.
POST /_cluster/reroute?retry_failed=true
{
"commands" : [
{
"allocate_replica" : {
"index" : "facebook-post-comment_v1", "shard" : 3,
"node" : "es84"
}
}]
}
The response is
{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[es74][192.168.1.74:9300][cluster:admin/reroute]"
}
],
"type": "illegal_argument_exception",
"reason": "[allocate_replica] allocation of [facebook-post-comment_v1][3] on node {es84}{VR2ChTZKSWu1Do1NFQawWQ}{ZJhk2hS2QGCl8j0bp6BRFw}{192.168.1.84}{192.168.1.84:9300}{xpack.installed=true} is not allowed, reason: [NO(shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-02-25T11:04:27.016Z], failed_attempts[5], delayed=false, details[failed shard on node [VR2ChTZKSWu1Do1NFQawWQ]: failed recovery, failure RecoveryFailedException[[facebook-post-comment_v1][3]: Recovery failed from {es73}{rzne2XGKQ0CCk_1I3ZGOUg}{mXiPfm07TWGgLlft4XYx0w}{192.168.1.73}{192.168.1.73:9300}{xpack.installed=true} into {es84}{VR2ChTZKSWu1Do1NFQawWQ}{qvBmNF6vRCOckUHZLt-Mxw}{192.168.1.84}{192.168.1.84:9300}{xpack.installed=true}]; nested: RemoteTransportException[[es73][192.168.1.73:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: RemoteTransportException[[es84][192.168.1.84:9300][internal:index/shard/recovery/translog_ops]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [5965608074/5.5gb], which is larger than the limit of [5964143001/5.5gb], usages [request=0/0b, fielddata=5069497235/4.7gb, in_flight_requests=164784/160.9kb, accounting=895946055/854.4mb]]; ], allocation_status[no_attempt]]])]..."
},
"status": 400
}
It says no on max_retry condition but shouldn't it be yes? Since it already reached the retry limit. Also, es75 node has the shard data already.