NodeEnvironment.assertEnvIsLocked threw java.io.IOException: The device is not ready

blademan · November 27, 2023, 9:50pm

ES is deployed on an Azure VMSS (Windows VMs). It's throwing java.io.IOException "The device is not ready" on some VMs when creating shards, while working well on some other VMs at the same time.

Here is what the exception looks like:

TraceLevel="WARN" ComponentName="default" Message="[2023-11-27T21:24:34,775][WARN ][org.elasticsearch.env.NodeEnvironment] lock assertion failed
java.io.IOException: The device is not ready
	at sun.nio.ch.FileDispatcherImpl.size0(Native Method)
	at sun.nio.ch.FileDispatcherImpl.size(FileDispatcherImpl.java:101)
	at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:310)
	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:170)
	at org.elasticsearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:941)
	at org.elasticsearch.env.NodeEnvironment.nodePaths(NodeEnvironment.java:766)
	at org.elasticsearch.monitor.fs.FsProbe.stats(FsProbe.java:55)
	at org.elasticsearch.monitor.fs.FsService.stats(FsService.java:60)
	at org.elasticsearch.monitor.fs.FsService.access$200(FsService.java:33)
	at org.elasticsearch.monitor.fs.FsService$FsInfoCache.refresh(FsService.java:78)
	at org.elasticsearch.monitor.fs.FsService$FsInfoCache.refresh(FsService.java:67)
	at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
	at org.elasticsearch.monitor.fs.FsService.stats(FsService.java:55)
	at org.elasticsearch.node.NodeService.stats(NodeService.java:110)
	at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:77)
	at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:42)
	at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:262)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:258)
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

All the VMs have identical configurations. When I RDP to the VM, I can see "node.lock" file in "K:\esdata\nodes\0", where it's supposed to be. The device seems to be ready to me - I can create other files in above folder and can even move "node.lock" file to another folder and back.
(K: is a data disk associated with that VM exclusively, not a shared location.)

Restarting VM can solve the problem in most of the cases, but it will occur again after some time.

What could be the root cause and how should I fix it?

DavidTurner · November 27, 2023, 10:10pm

This exception is coming directly from the OS, and is not something Elasticsearch can work around. You'll need to speak to your infra people, or Azure support, to work out why it's happening.

blademan · November 30, 2023, 7:46pm

Thank you, David!

I wrote some java code to simulate the situation. The code does below things:

When it starts, create a file "node1.lock" and lock it with FileChannel.open() and tryLock(). The implementation is same as what Lucene's NativeFSLockFactory does here: https://github.com/apache/lucene/blob/4bc7850465dfac9dc0638d9ee782007883869ffe/lucene/core/src/java/org/apache/lucene/store/NativeFSLockFactory.java#L112-L113
After 0-120 seconds (random, to simulate the frequency of shard creation in our case), it will
a. Get file size of node1.lock with FileChannel.size(). This is same as what Lucene's NativeFSLockFactory does here: https://github.com/apache/lucene/blob/4bc7850465dfac9dc0638d9ee782007883869ffe/lucene/core/src/java/org/apache/lucene/store/NativeFSLockFactory.java#L167-L173
b. Create a file "node2.lock" and lock it with FileChannel.open() and tryLock(). The implementation is same as what Lucene's NativeFSLockFactory does.
c. Get file size of node2.lock with FileChannel.size(). Again, the implementation is same as what Lucene's NativeFSLockFactory does.
Repeat step 2 indefinitely.

In other words, the code will create and keep a "permanent" file channel/lock and get the file size every 0-120 seconds. It will also create a transient file channel/lock, get the file size, and then release lock and delete the file every 0-120 seconds.

I put above code on 9 Azure VMs and ran it for 2 days. This morning I finally got a repro of the "java.io.IOException: The device is not ready" issue on one of those 9 vms. And the IOException was thrown by the "permanent" part. The transient file creation and lock part is still working well at the same time.

Is it expected that elasticsearch will keep the file channel and lock of "node.lock" for a long time?

DavidTurner · November 30, 2023, 8:46pm

Yes, that's expected.

system · December 28, 2023, 8:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch failure: Environment is not locked Elasticsearch	4	3188	June 21, 2019
java.io.IOException: failed to obtain in-memory shard lock Elasticsearch	15	6093	October 22, 2018
My online cluster frequently suffered from A lot many so sucked LockObtainFailedException Elasticsearch	1	508	February 23, 2018
LockReleaseFailedException Elasticsearch	3	445	July 6, 2017
Could not initialize the shard Elasticsearch	1	473	July 6, 2017

NodeEnvironment.assertEnvIsLocked threw java.io.IOException: The device is not ready

Related topics