Failed to create shard, failure IOException[failed to obtain in-memory shard lock]

jthoni · June 13, 2020, 10:22am

We just rebuilt our cluster and are running into a problem we have not had before. We see almost constant relocating. During this process, we will eventually (every couple hours) run into a node or two stuck in Uninitialized. On _cluster/allocation/explain?pretty, we get:

{
  "index": "blah",
  "shard": 71,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "ALLOCATION_FAILED",
    "at": "2020-06-12T19:23:11.894Z",
    "failed_allocation_attempts": 5,
    "details": "failed shard on node [-zoTGYAhSuOvdfXm9WnAdw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[foo_index][71]: obtaining shard lock timed out after 5000ms]; ",
    "last_allocation_status": "no_attempt"
  },

When we see this, running this always works:

POST /_cluster/reroute?retry_failed=true

If we leave it long enough, however, we run into an error where all replicas for a shard are offline and we go into Red.

What might cause this constant churn of things moving around?

We did have node_left delay set to 5 min, and saw this once:

"allocate_explanation": "cannot allocate because the cluster is still waiting 2.9m for the departed node holding a replica to rejoin, despite being allowed to allocate the shard to at least one other node",

So I changed that to 2m and bumped retries to 10, but still hitting same error.

Looking for ideas on where to start looking.

Thanks

jthoni · June 13, 2020, 4:38pm

Actually, I am not sure on the order of operations here. Looking at some telemetry, it looks like the Uninitialized state is what starts it.

In this chart:
Red --> Uninitialized
Orange --> Initializing
Yellow --> Relocating

it looks like the shards first go uninitialized, then initialized, then relocated. I guess that makes sense, but I am not sure why the initial Uninitialized happened.

The large spike at the right happened and resolved with no shards getting stuck in Uninitialized (i.e. all green after that).

DavidTurner · June 13, 2020, 5:35pm

Are nodes leaving the cluster and then immediately rejoining? Look for messages from the MasterService (on the elected master) about that. That'd explain shards suddenly becoming uninitialized, and also the failed to obtain in-memory shard lock message.

jthoni · June 14, 2020, 4:29am

I currently can't get to the logs on the local node. In this case would it help to set index.unassigned.node_left.delayed_timeout to something really small? We usually have it with a longer delay (5m, but I changed to 2m) to deal with node reboots after patches, but I don't think that is happening here.

DavidTurner · June 14, 2020, 7:50am

Not really, if your nodes aren't staying in the cluster then all sorts of other things will be behaving badly too. The fix is to keep the nodes in the cluster.

Troubleshooting this without logs is pretty much impossible so that's the first priority.

system · July 12, 2020, 7:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Red Cluster State: failed to create shard, failure IOException[failed to obtain in-memory shard lock] Elasticsearch	1	560	September 15, 2020
Could not initialize the shard Elasticsearch	1	482	July 6, 2017
Allocation Error Elasticsearch	4	13080	June 19, 2017
Unassigned shards found Elasticsearch	2	5258	October 18, 2017
Shards fail to reallocate Elasticsearch	6	585	July 6, 2017

Failed to create shard, failure IOException[failed to obtain in-memory shard lock]

Related topics