Massive queue in "refresh" thread pool on a single node, causing timeouts


This is a difficult one because it only happens randomly, I can't figure out a way to reproduce it consistently. What happens is I'll notice a lot of timeouts trying to index to the cluster, and what I've found that seems to be the root cause is a ton of refresh threads queued up.

I'm curious if #1 there is a way to determine what is causing the backpressure in the refresh queue, and #2 if there is a way to "reset" it without restarting the node.

Thanks!

1 Like

For #1 I would start by checking GET _nodes/hot_threads?threads=99999. For #2 the answer is no, Elasticsearch will just keep on processing the tasks until they're done.

I'm having the exact same issue and, unfortunately, happing randomly too. I'll be very interested checking your hot_threads too.
I'll share mine if it can help you :

    0.0% (0s out of 500ms) cpu usage by thread 'threadDeathWatcher-2-1'
     10/10 snapshots sharing following 4 elements
       java.lang.Thread.sleep(Native Method)
       io.netty.util.ThreadDeathWatcher$Watcher.run(ThreadDeathWatcher.java:152)
       io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
       java.lang.Thread.run(Thread.java:748)
   
    0.0% (0s out of 500ms) cpu usage by thread 'DestroyJavaVM'
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
   
    0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][refresh][T#1]'
     10/10 snapshots sharing following 27 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
       java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
       java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:251)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:910)
       org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)
       org.elasticsearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnReplica(TransportShardRefreshAction.java:65)
       org.elasticsearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnReplica(TransportShardRefreshAction.java:38)
       org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:494)
       org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:467)
       org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:147)
       org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationLock(IndexShard.java:1673)
       org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:566)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:451)
       org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:441)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
    0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][refresh][T#2]'
     10/10 snapshots sharing following 39 elements
       sun.nio.fs.WindowsNativeDispatcher.DeleteFile0(Native Method)
       sun.nio.fs.WindowsNativeDispatcher.DeleteFile(WindowsNativeDispatcher.java:114)
       sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:249)
       sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
       java.nio.file.Files.delete(Files.java:1126)
       org.apache.lucene.store.FSDirectory.privateDeleteFile(FSDirectory.java:373)
       org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:335)
       org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:62)
       org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:62)
       org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:700)
       org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:705)
       org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38)
       org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:723)
       org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:717)
       org.apache.lucene.index.IndexFileDeleter.deleteNewFiles(IndexFileDeleter.java:693)
       org.apache.lucene.index.IndexWriter.deleteNewFiles(IndexWriter.java:4965)
       org.apache.lucene.index.DocumentsWriter$DeleteNewFilesEvent.process(DocumentsWriter.java:771)
       org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5043)
       org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5034)
       org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:477)
       org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
       org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:266)
       org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:256)
       org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:104)
       org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
       org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:910)
       org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)
       org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:690)
       org.elasticsearch.index.IndexService.access$400(IndexService.java:92)
       org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:832)
       org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:743)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
    0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][flush][T#4334]'
     10/10 snapshots sharing following 16 elements
       org.apache.lucene.index.IndexWriter.setLiveCommitData(IndexWriter.java:3116)
       org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:1562)
       org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1063)
       org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:780)
       org.elasticsearch.indices.flush.SyncedFlushService.performPreSyncedFlush(SyncedFlushService.java:414)
       org.elasticsearch.indices.flush.SyncedFlushService.access$1000(SyncedFlushService.java:70)
       org.elasticsearch.indices.flush.SyncedFlushService$PreSyncedFlushTransportHandler.messageReceived(SyncedFlushService.java:696)
       org.elasticsearch.indices.flush.SyncedFlushService$PreSyncedFlushTransportHandler.messageReceived(SyncedFlushService.java:692)
       org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

In our side, it seems that a thread has been lock for a while on a file deletion on the Lucene side. We still didn't find the reason blocking the deletion of this file (we tried excluding Antivirus, stopping backup etc..) so we just keep restarting node when this is happening ...

Elasticsearch (well, Lucene) assumes that deleting a file should be fast so it's ok to block a thread while it happens. Unfortunately I've also no idea why it should be taking so much longer than expected in your case, but it's definitely a problem in your environment.

I can't exclude a problem in our environment of course.
We tried shutting off ALL services on the server excepted the ElasticSearch one + shutting off Antivirus + working locally , and these locks were still there..
I mean, i'm not a JAVA expert, but don't you think it's possible that some threads should be locked on a semaphore lock acquirement for deleting a Lucene file or something ?
Maybe an issue in our Elastic configuration i don't really at this point to be honest.

Both @Rayzbam and I are running our ES clusters on Windows Server 2019, so this could be something unique to the way Java & Windows interact.

Not to say that there's no locking going on, but we definitely see that the thread is not waiting for a monitor or anything, it's stuck deleting a file for far too long:

sun.nio.fs.WindowsNativeDispatcher.DeleteFile0 is a native method whose implementation in the JDK is this:

Pretty much the only thing it could be doing is waiting for DeleteFileW to return.

I don't know a huge amount about debugging this kind of thing in Windows (vs say Linux) but I believe that Process Monitor should show these delete calls taking a long time too.

Unfortunately, Process Monitor is useful when you know which file is locked. ElasticSearch hot_threads logs don't give us any clue on that actually.
I'm not very familar with Process Monitor to be honest, but last time i had this issue i tried it but didn't find any locked file by java.exe.
I'll give you another catch when it happening again (should be soon enough to have this topic still open).

I believe you can set up the filters just to report all DeleteFileW calls, there shouldn't be too many of them. TBC I don't think it's anything about a file being locked, the file it's deleting here won't have been locked. At least not by Lucene/Elasticsearch.

If it's genuinely completely stuck on deleting a single file, rather than just going really slowly through the list, then a heap dump of Elasticsearch would tell you the file name.

Thanks @DavidTurner.

I'll try this the next time this issue occurs. I'll let you know if i find out something useful.

I just had this issue occur. I have 1649 queued threads in the refresh pool on my elastic2 node.

My call to GET /_nodes/elastic2/hot_threads look like this:

::: {elastic2}{hRztlqGeRTGFFF-0_eQHpQ}{9kjCk6_QRAuR1V1-Ls5Nlw}{10.69.5.21}{10.69.5.21:9300}{dimrt}{xpack.installed=true, transform.node=true}
   Hot threads at 2021-08-17T15:24:05.048Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   34.4% (171.8ms out of 500ms) cpu usage by thread 'elasticsearch[elastic2][management][T#4]'
     3/10 snapshots sharing following 14 elements
       app//org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1048)
       app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
       app//org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:110)
       app//org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:48)
       app//org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:427)
       app//org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:405)
       app//org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:392)
       app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
       app//org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:263)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)
     7/10 snapshots sharing following 9 elements
       java.base@14.0.1/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@14.0.1/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252)
       java.base@14.0.1/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:740)
       java.base@14.0.1/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:684)
       java.base@14.0.1/java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1374)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1055)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)

Then I call it again, and it looks like this:

::: {elastic2}{hRztlqGeRTGFFF-0_eQHpQ}{9kjCk6_QRAuR1V1-Ls5Nlw}{10.69.5.21}{10.69.5.21:9300}{dimrt}{xpack.installed=true, transform.node=true}
   Hot threads at 2021-08-17T15:22:11.364Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    6.3% (31.2ms out of 500ms) cpu usage by thread 'elasticsearch[elastic2][write][T#1]'
     9/10 snapshots sharing following 2 elements
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)
   
    6.3% (31.2ms out of 500ms) cpu usage by thread 'elasticsearch[elastic2][write][T#4]'
     10/10 snapshots sharing following 2 elements
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)

Every time I call it, it has differing responses, sometimes even without any stacktraces, and sometimes with many.

This makes me think that there isn't a single stuck thread causing the massive refresh queue

Also, here is my call to GET /_nodes/elastic2/stats

I don't see a smoking gun, but also am not sure what to look for

Maybe try GET _nodes/elastic2/hot_threads?threads=99999&ignore_idle_threads=false. If there's items in the refresh queue then there will be no idle threads in the pool, but they may not be in the top 3 hottest ones.

@DavidTurner Do you think our "DeleteFile" issue could be a swapping issue ? We didn't disabled swapping on our servers, and as you said, the only stacktrace we have is the "inner" Java deletion which only try to find the pointer to delete thanks to the file passed to the method.

Thanks for your help.

I wouldn't think so, but this is beyond my knowledge of Windows. Maybe. Is there much swapping happening? No idea how you'd find out, sorry.

By "find the pointer" do you mean you think it's stuck in jlong_to_ptr? This is just a basic pointer-casting macro, it won't be stuck there:

This happened again to me.

I ran the command below (as requested by @DavidTurner)

The results are here: https://pastebin.com/raw/7PJmL8SF

From what I see, there are threads stuck on
java.base@14.0.1/sun.nio.fs.WindowsNativeDispatcher.DeleteFile0(Native Method)

Is there any further info that I can get to help troubleshoot this annoying problem?

As mentioned above DeleteFile0 is a very simple function that just calls into the Windows API function DeleteFileW so the question is why is DeleteFileW taking such a long time to return. Whatever it's doing it's not running any Elasticsearch code at this point.

Unfortunately I have no idea how to even start investigating system-level questions like that more deeply on Windows.

@Rayzbam, for problems deleting files on Windows, AV and the Index Service are common culprits though the latter is less likely on server editions of Windows. I've found the Process Explorer tool to be quite helpful in identifying locks on files since you can search by directory name or filename across all processes. I would do that on the data directory on the Elasticsearch node that is stuck deleting the file to see if any other processes are accessing any of the files in that directory.

1 Like