Hi,
I’m using Elasticsearch 5.4, and have a cluster initialization problem
It seems that shards do not manage to finish their initialization for more than 3 days.
I keep getting the following exception in the logs for ALL unassigned shards:
[2019-03-28T00:05:51,180][TRACE][o.e.c.r.a.d.AllocationDeciders] [esmaster2] Can not allocate [[test-index-20190217][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2019-03-25T14:31:27.494Z], delayed=false, allocation_status[no_attempt]]] on node [{es1}{iNzYU8rGTLeSFwRuyX_SSg}{EBkmEu77RlGZldrXjRFK5g}{192.168.0.30}{192.168.0.30:9300}{rack=1}] due to [SameShardAllocationDecider]
[2019-03-28T00:05:51,180][TRACE][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [esmaster2] [test-index-20190217][2], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2019-03-25T14:31:27.494Z], delayed=false, allocation_status[no_attempt]]: ignoring allocation, can't be allocated on any node
Restarting the cluster several times did not help,
I continued to get the same log, and the following results for Allocation explain API:
- same_shard decider
- 'reached the limit of incoming shard recoveries [100]'
- cluster_rebalance
- 'shard is in the process of initializing on node' But it started 3 days ago
On Recovery API and Index recovery API, I see some of unassigned shards are missing, and some stuck on init/ index status for 3 days.
Other steps I did in order to fix the cluster health status with no success:
- To force the cluster to replicate the shards, I lower the number of replicates from 1 -> 0, and then (when they are relocated properly) back to 1. But still get the same log as above while replica added.
- Delete and create replica does not help since it gets stuck on old running/stuck recovery tasks
- Cancel those cluster tasks did not help since they are not cancellable..
I drilled down the code which is writing this log and I don’t understand why it happened here,
Because in our case the 'cluster.routing.allocation.same_shard.host param' is defined as false, this check is not needed, and it should not fail on this issue
SameShardAllocationDecider:
The OR condition in line 74 seems like a bug:
if (decision.type() == Decision.Type.NO || sameHost == false)