NodeEnvironment.assertEnvIsLocked threw java.io.IOException: The device is not ready

ES is deployed on an Azure VMSS (Windows VMs). It's throwing java.io.IOException "The device is not ready" on some VMs when creating shards, while working well on some other VMs at the same time.

Here is what the exception looks like:

TraceLevel="WARN" ComponentName="default" Message="[2023-11-27T21:24:34,775][WARN ][org.elasticsearch.env.NodeEnvironment] lock assertion failed
java.io.IOException: The device is not ready
	at sun.nio.ch.FileDispatcherImpl.size0(Native Method)
	at sun.nio.ch.FileDispatcherImpl.size(FileDispatcherImpl.java:101)
	at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:310)
	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:170)
	at org.elasticsearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:941)
	at org.elasticsearch.env.NodeEnvironment.nodePaths(NodeEnvironment.java:766)
	at org.elasticsearch.monitor.fs.FsProbe.stats(FsProbe.java:55)
	at org.elasticsearch.monitor.fs.FsService.stats(FsService.java:60)
	at org.elasticsearch.monitor.fs.FsService.access$200(FsService.java:33)
	at org.elasticsearch.monitor.fs.FsService$FsInfoCache.refresh(FsService.java:78)
	at org.elasticsearch.monitor.fs.FsService$FsInfoCache.refresh(FsService.java:67)
	at org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
	at org.elasticsearch.monitor.fs.FsService.stats(FsService.java:55)
	at org.elasticsearch.node.NodeService.stats(NodeService.java:110)
	at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:77)
	at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:42)
	at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:140)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:262)
	at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:258)
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:674)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

All the VMs have identical configurations. When I RDP to the VM, I can see "node.lock" file in "K:\esdata\nodes\0", where it's supposed to be. The device seems to be ready to me - I can create other files in above folder and can even move "node.lock" file to another folder and back.
(K: is a data disk associated with that VM exclusively, not a shared location.)

Restarting VM can solve the problem in most of the cases, but it will occur again after some time.

What could be the root cause and how should I fix it?

This exception is coming directly from the OS, and is not something Elasticsearch can work around. You'll need to speak to your infra people, or Azure support, to work out why it's happening.

Thank you, David!

I wrote some java code to simulate the situation. The code does below things:

  1. When it starts, create a file "node1.lock" and lock it with FileChannel.open() and tryLock(). The implementation is same as what Lucene's NativeFSLockFactory does here: https://github.com/apache/lucene/blob/4bc7850465dfac9dc0638d9ee782007883869ffe/lucene/core/src/java/org/apache/lucene/store/NativeFSLockFactory.java#L112-L113
  2. After 0-120 seconds (random, to simulate the frequency of shard creation in our case), it will
    a. Get file size of node1.lock with FileChannel.size(). This is same as what Lucene's NativeFSLockFactory does here: https://github.com/apache/lucene/blob/4bc7850465dfac9dc0638d9ee782007883869ffe/lucene/core/src/java/org/apache/lucene/store/NativeFSLockFactory.java#L167-L173
    b. Create a file "node2.lock" and lock it with FileChannel.open() and tryLock(). The implementation is same as what Lucene's NativeFSLockFactory does.
    c. Get file size of node2.lock with FileChannel.size(). Again, the implementation is same as what Lucene's NativeFSLockFactory does.
  3. Repeat step 2 indefinitely.

In other words, the code will create and keep a "permanent" file channel/lock and get the file size every 0-120 seconds. It will also create a transient file channel/lock, get the file size, and then release lock and delete the file every 0-120 seconds.

I put above code on 9 Azure VMs and ran it for 2 days. This morning I finally got a repro of the "java.io.IOException: The device is not ready" issue on one of those 9 vms. And the IOException was thrown by the "permanent" part. The transient file creation and lock part is still working well at the same time.

Is it expected that elasticsearch will keep the file channel and lock of "node.lock" for a long time?

Yes, that's expected.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.