My cluster setup is 2 nodes, both docker based on different VMs in the same network.
My cluster health becomes yellow after few hours it start to get unassigned status one by one, till after a day all the replica shards become unassigned and when I check the shard allocation it looks like this:
So call the following command: POST /_cluster/reroute?retry_failed=true
Immediately the shards are starting to initialize:
After like 3-4 minutes, it looks like all assigned, and the cluster health is green:
So, I started using the allocation/explain API: GET /_cluster/allocation/explain?pretty
And I got:
{
"index" : "projects",
"shard" : 4,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "MANUAL_ALLOCATION",
"at" : "2020-07-21T08:22:48.307Z",
"details" : "failed shard on node [Vnl1IdQOTdGDZcr0qG1Wxw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[projects][4]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "awaiting_info",
"allocate_explanation" : "cannot allocate because information about existing shard data is still being retrieved from some of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "Vnl1IdQOTdGDZcr0qG1Wxw",
"node_name" : "eu01",
"transport_address" : "172.18.4.6:9300",
"node_decision" : "throttled",
"deciders" : [
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
]
},
{
"node_id" : "gI3ylY0JTNWuCSOSJ1vN2g",
"node_name" : "us01",
"transport_address" : "172.18.1.11:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "a copy of this shard is already allocated to this node [[projects][4], node[gI3ylY0JTNWuCSOSJ1vN2g], [P], s[STARTED], a[id=X-D0rlNmRmuTSkWlR3AQ7w]]"
},
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of outgoing shard recoveries [2] on the node [gI3ylY0JTNWuCSOSJ1vN2g] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
]
}
]
}
I checked my disk space, it's 90% free, so that is the case here.
Can someone help me understand what is the issue here and why the shards are getting unassigned every day?
Thanks