Hey,
we have an issue with out 7.18.18 cluster. It seems that closing a shard is throwing issues. This happens with several data nodes, so it's not a single glitch. This has never happened with out previous 7.12 version before the upgrade. When data is moved around (shard allocation via node tags is in place), at some point when closing a shard, the lock cannot be obtained like this
[2024-06-07T23:22:13,162][WARN ][o.e.i.c.IndicesClusterStateService] [HOSTNAME] [INDEX][6] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
at org.elasticsearch.index.IndexService.createShard(IndexService.java:517) ~[elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:848) ~[elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:176) ~[elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:577) [elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:554) [elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:231) [elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:553) [elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:539) [elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:503) [elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:428) [elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.cluster.service.ClusterApplierService.access$000(ClusterApplierService.java:56) [elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) [elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:262) [elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:225) [elasticsearch-7.17.18.jar:7.17.18]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [INDEX][6]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [417105173ms]
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:843) ~[elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:743) ~[elasticsearch-7.17.18.jar:7.17.18]
at org.elasticsearch.index.IndexService.createShard(IndexService.java:420) ~[elasticsearch-7.17.18.jar:7.17.18]
... 17 more
The interesting part is the last message stating lock already held for [closing shard]
- so it seems, that previously, like 4.8 days ago.
Checking the disk of these nodes, they also contain the old shard, so it is not properly removed. As this has happened on several nodes at once, we run danger of running out of disk space in our cluster.
There is one other log message before the logging of such an exception that might help.
[2024-06-06T03:23:19,868][INFO ][o.e.i.s.IndexShard ] [NODE] [INDEX][5] primary-replica resync completed with 0 operations
There is no other notable logging during or before the first occurence of that issue on the data node. The first time the duration looks good:
obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [5214ms]
Looking at the master node during the first occurance of the error, there are no specific error messages, except some notices that a node runs out of disk and thus allocation was prevented.
Is there anything else we can help to dive deeper into this and help debugging?
Thanks and have a nice day!
--Alex