Elasticsearch does not remove shards on closing due to locking issues

Hey,

we have an issue with out 7.18.18 cluster. It seems that closing a shard is throwing issues. This happens with several data nodes, so it's not a single glitch. This has never happened with out previous 7.12 version before the upgrade. When data is moved around (shard allocation via node tags is in place), at some point when closing a shard, the lock cannot be obtained like this

[2024-06-07T23:22:13,162][WARN ][o.e.i.c.IndicesClusterStateService] [HOSTNAME] [INDEX][6] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:517) ~[elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:848) ~[elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:176) ~[elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:577) [elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:554) [elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231) [elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:553) [elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:539) [elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:503) [elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:428) [elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.cluster.service.ClusterApplierService.access$000(ClusterApplierService.java:56) [elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) [elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:262) [elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:225) [elasticsearch-7.17.18.jar:7.17.18]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [INDEX][6]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [417105173ms]
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:843) ~[elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:743) ~[elasticsearch-7.17.18.jar:7.17.18]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:420) ~[elasticsearch-7.17.18.jar:7.17.18]
        ... 17 more

The interesting part is the last message stating lock already held for [closing shard] - so it seems, that previously, like 4.8 days ago.

Checking the disk of these nodes, they also contain the old shard, so it is not properly removed. As this has happened on several nodes at once, we run danger of running out of disk space in our cluster.

There is one other log message before the logging of such an exception that might help.

[2024-06-06T03:23:19,868][INFO ][o.e.i.s.IndexShard       ] [NODE] [INDEX][5] primary-replica resync completed with 0 operations

There is no other notable logging during or before the first occurence of that issue on the data node. The first time the duration looks good:

obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [5214ms]

Looking at the master node during the first occurance of the error, there are no specific error messages, except some notices that a node runs out of disk and thus allocation was prevented.

Is there anything else we can help to dive deeper into this and help debugging?

Thanks and have a nice day!

--Alex

The time here is a bit misleading, that's the time since the shard lock was originally obtained when the shard was started. Ages of several days is not a problem here.

I'm surprised you didn't see this in 7.12 tho, this problem with slow shard shutdowns has been around a very long time. The main problem seems to be Merges sometimes do lots of work even after being aborted · Issue #13354 · apache/lucene · GitHub. It will go away, eventually, after enough retries tho. This area has been reworked in 8.x to be more robust to this problem but unfortunately not in a way that can reasonably be backported to 7.17, so the best advice here is to upgrade.

Thanks for the fast response, David! It indeed makes sense, that merging is blocking this, as this is a high write index. That long duration also makes sense then, but the exception keeps occuring for days in our logs, once it starts, it seems.

It will go away, eventually, after enough retries tho.

This issue happened for days on nodes (until we figured out the system could not write anymore and only a restart helped). Is there anything I can do to figure out if it will go away? What's the criteria for that?

If it happens again, I'll take a look at the hot_threads output.

We're luckily preparing the 8.x ugprade, but that may take us another couple of weeks.

I assume another workaround would be to have bigger disks on these systems, as then there is a lower likelihood of shuffling shards around...

When the closing shard finally closes, it'll release the shard lock so that it can be acquired by the new copy of the shard.