This is a difficult one because it only happens randomly, I can't figure out a way to reproduce it consistently. What happens is I'll notice a lot of timeouts trying to index to the cluster, and what I've found that seems to be the root cause is a ton of refresh threads queued up.
I'm curious if #1 there is a way to determine what is causing the backpressure in the refresh queue, and #2 if there is a way to "reset" it without restarting the node.
For #1 I would start by checking GET _nodes/hot_threads?threads=99999. For #2 the answer is no, Elasticsearch will just keep on processing the tasks until they're done.
I'm having the exact same issue and, unfortunately, happing randomly too. I'll be very interested checking your hot_threads too.
I'll share mine if it can help you :
0.0% (0s out of 500ms) cpu usage by thread 'threadDeathWatcher-2-1'
10/10 snapshots sharing following 4 elements
java.lang.Thread.sleep(Native Method)
io.netty.util.ThreadDeathWatcher$Watcher.run(ThreadDeathWatcher.java:152)
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
java.lang.Thread.run(Thread.java:748)
0.0% (0s out of 500ms) cpu usage by thread 'DestroyJavaVM'
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
unique snapshot
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][refresh][T#1]'
10/10 snapshots sharing following 27 elements
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:251)
org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:910)
org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)
org.elasticsearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnReplica(TransportShardRefreshAction.java:65)
org.elasticsearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnReplica(TransportShardRefreshAction.java:38)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:494)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:467)
org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:147)
org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationLock(IndexShard.java:1673)
org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:566)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:451)
org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:441)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][refresh][T#2]'
10/10 snapshots sharing following 39 elements
sun.nio.fs.WindowsNativeDispatcher.DeleteFile0(Native Method)
sun.nio.fs.WindowsNativeDispatcher.DeleteFile(WindowsNativeDispatcher.java:114)
sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:249)
sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
java.nio.file.Files.delete(Files.java:1126)
org.apache.lucene.store.FSDirectory.privateDeleteFile(FSDirectory.java:373)
org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:335)
org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:62)
org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:62)
org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:700)
org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:705)
org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38)
org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:723)
org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:717)
org.apache.lucene.index.IndexFileDeleter.deleteNewFiles(IndexFileDeleter.java:693)
org.apache.lucene.index.IndexWriter.deleteNewFiles(IndexWriter.java:4965)
org.apache.lucene.index.DocumentsWriter$DeleteNewFilesEvent.process(DocumentsWriter.java:771)
org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5043)
org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5034)
org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:477)
org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:266)
org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:256)
org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:104)
org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156)
org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:910)
org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)
org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:690)
org.elasticsearch.index.IndexService.access$400(IndexService.java:92)
org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:832)
org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:743)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][flush][T#4334]'
10/10 snapshots sharing following 16 elements
org.apache.lucene.index.IndexWriter.setLiveCommitData(IndexWriter.java:3116)
org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:1562)
org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1063)
org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:780)
org.elasticsearch.indices.flush.SyncedFlushService.performPreSyncedFlush(SyncedFlushService.java:414)
org.elasticsearch.indices.flush.SyncedFlushService.access$1000(SyncedFlushService.java:70)
org.elasticsearch.indices.flush.SyncedFlushService$PreSyncedFlushTransportHandler.messageReceived(SyncedFlushService.java:696)
org.elasticsearch.indices.flush.SyncedFlushService$PreSyncedFlushTransportHandler.messageReceived(SyncedFlushService.java:692)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
In our side, it seems that a thread has been lock for a while on a file deletion on the Lucene side. We still didn't find the reason blocking the deletion of this file (we tried excluding Antivirus, stopping backup etc..) so we just keep restarting node when this is happening ...
Elasticsearch (well, Lucene) assumes that deleting a file should be fast so it's ok to block a thread while it happens. Unfortunately I've also no idea why it should be taking so much longer than expected in your case, but it's definitely a problem in your environment.
I can't exclude a problem in our environment of course.
We tried shutting off ALL services on the server excepted the ElasticSearch one + shutting off Antivirus + working locally , and these locks were still there..
I mean, i'm not a JAVA expert, but don't you think it's possible that some threads should be locked on a semaphore lock acquirement for deleting a Lucene file or something ?
Maybe an issue in our Elastic configuration i don't really at this point to be honest.
Not to say that there's no locking going on, but we definitely see that the thread is not waiting for a monitor or anything, it's stuck deleting a file for far too long:
sun.nio.fs.WindowsNativeDispatcher.DeleteFile0 is a native method whose implementation in the JDK is this:
Pretty much the only thing it could be doing is waiting for DeleteFileW to return.
I don't know a huge amount about debugging this kind of thing in Windows (vs say Linux) but I believe that Process Monitor should show these delete calls taking a long time too.
Unfortunately, Process Monitor is useful when you know which file is locked. ElasticSearch hot_threads logs don't give us any clue on that actually.
I'm not very familar with Process Monitor to be honest, but last time i had this issue i tried it but didn't find any locked file by java.exe.
I'll give you another catch when it happening again (should be soon enough to have this topic still open).
I believe you can set up the filters just to report all DeleteFileW calls, there shouldn't be too many of them. TBC I don't think it's anything about a file being locked, the file it's deleting here won't have been locked. At least not by Lucene/Elasticsearch.
If it's genuinely completely stuck on deleting a single file, rather than just going really slowly through the list, then a heap dump of Elasticsearch would tell you the file name.
My call to GET /_nodes/elastic2/hot_threads look like this:
::: {elastic2}{hRztlqGeRTGFFF-0_eQHpQ}{9kjCk6_QRAuR1V1-Ls5Nlw}{10.69.5.21}{10.69.5.21:9300}{dimrt}{xpack.installed=true, transform.node=true}
Hot threads at 2021-08-17T15:24:05.048Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
34.4% (171.8ms out of 500ms) cpu usage by thread 'elasticsearch[elastic2][management][T#4]'
3/10 snapshots sharing following 14 elements
app//org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1048)
app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
app//org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:110)
app//org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:48)
app//org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:427)
app//org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:405)
app//org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:392)
app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
app//org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:263)
app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695)
app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
java.base@14.0.1/java.lang.Thread.run(Thread.java:832)
7/10 snapshots sharing following 9 elements
java.base@14.0.1/jdk.internal.misc.Unsafe.park(Native Method)
java.base@14.0.1/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252)
java.base@14.0.1/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:740)
java.base@14.0.1/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:684)
java.base@14.0.1/java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1374)
java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1055)
java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
java.base@14.0.1/java.lang.Thread.run(Thread.java:832)
Then I call it again, and it looks like this:
::: {elastic2}{hRztlqGeRTGFFF-0_eQHpQ}{9kjCk6_QRAuR1V1-Ls5Nlw}{10.69.5.21}{10.69.5.21:9300}{dimrt}{xpack.installed=true, transform.node=true}
Hot threads at 2021-08-17T15:22:11.364Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
6.3% (31.2ms out of 500ms) cpu usage by thread 'elasticsearch[elastic2][write][T#1]'
9/10 snapshots sharing following 2 elements
java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
java.base@14.0.1/java.lang.Thread.run(Thread.java:832)
6.3% (31.2ms out of 500ms) cpu usage by thread 'elasticsearch[elastic2][write][T#4]'
10/10 snapshots sharing following 2 elements
java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
java.base@14.0.1/java.lang.Thread.run(Thread.java:832)
Every time I call it, it has differing responses, sometimes even without any stacktraces, and sometimes with many.
This makes me think that there isn't a single stuck thread causing the massive refresh queue
Maybe try GET _nodes/elastic2/hot_threads?threads=99999&ignore_idle_threads=false. If there's items in the refresh queue then there will be no idle threads in the pool, but they may not be in the top 3 hottest ones.
@DavidTurner Do you think our "DeleteFile" issue could be a swapping issue ? We didn't disabled swapping on our servers, and as you said, the only stacktrace we have is the "inner" Java deletion which only try to find the pointer to delete thanks to the file passed to the method.
As mentioned aboveDeleteFile0 is a very simple function that just calls into the Windows API function DeleteFileW so the question is why is DeleteFileW taking such a long time to return. Whatever it's doing it's not running any Elasticsearch code at this point.
Unfortunately I have no idea how to even start investigating system-level questions like that more deeply on Windows.
@Rayzbam, for problems deleting files on Windows, AV and the Index Service are common culprits though the latter is less likely on server editions of Windows. I've found the Process Explorer tool to be quite helpful in identifying locks on files since you can search by directory name or filename across all processes. I would do that on the data directory on the Elasticsearch node that is stuck deleting the file to see if any other processes are accessing any of the files in that directory.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.