We just rebuilt our cluster and are running into a problem we have not had before. We see almost constant relocating. During this process, we will eventually (every couple hours) run into a node or two stuck in Uninitialized. On _cluster/allocation/explain?pretty, we get:
{
"index": "blah",
"shard": 71,
"primary": false,
"current_state": "unassigned",
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"at": "2020-06-12T19:23:11.894Z",
"failed_allocation_attempts": 5,
"details": "failed shard on node [-zoTGYAhSuOvdfXm9WnAdw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[foo_index][71]: obtaining shard lock timed out after 5000ms]; ",
"last_allocation_status": "no_attempt"
},
When we see this, running this always works:
POST /_cluster/reroute?retry_failed=true
If we leave it long enough, however, we run into an error where all replicas for a shard are offline and we go into Red.
What might cause this constant churn of things moving around?
We did have node_left delay set to 5 min, and saw this once:
"allocate_explanation": "cannot allocate because the cluster is still waiting 2.9m for the departed node holding a replica to rejoin, despite being allowed to allocate the shard to at least one other node",
So I changed that to 2m and bumped retries to 10, but still hitting same error.
Looking for ideas on where to start looking.
Thanks