Massive queue in "refresh" thread pool on a single node, causing timeouts

BradVido · August 8, 2021, 1:54pm

This is a difficult one because it only happens randomly, I can't figure out a way to reproduce it consistently. What happens is I'll notice a lot of timeouts trying to index to the cluster, and what I've found that seems to be the root cause is a ton of refresh threads queued up.

I'm curious if #1 there is a way to determine what is causing the backpressure in the refresh queue, and #2 if there is a way to "reset" it without restarting the node.

Thanks!

DavidTurner · August 8, 2021, 4:48pm

For #1 I would start by checking GET _nodes/hot_threads?threads=99999. For #2 the answer is no, Elasticsearch will just keep on processing the tasks until they're done.

Rayzbam · August 11, 2021, 8:48am

I'm having the exact same issue and, unfortunately, happing randomly too. I'll be very interested checking your hot_threads too.
I'll share mine if it can help you :

    0.0% (0s out of 500ms) cpu usage by thread 'threadDeathWatcher-2-1'
     10/10 snapshots sharing following 4 elements
       java.lang.Thread.sleep(Native Method)
       io.netty.util.ThreadDeathWatcher$Watcher.run(ThreadDeathWatcher.java:152)
       io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
       java.lang.Thread.run(Thread.java:748)
   
    0.0% (0s out of 500ms) cpu usage by thread 'DestroyJavaVM'
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
     unique snapshot
   
    0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][refresh][T#1]'
     10/10 snapshots sharing following 27 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
       java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
       java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
       java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:251)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:910)
       org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)
       org.elasticsearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnReplica(TransportShardRefreshAction.java:65)
       org.elasticsearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnReplica(TransportShardRefreshAction.java:38)
       org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:494)
       org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:467)
       org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:147)
       org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationLock(IndexShard.java:1673)
       org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:566)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:451)
       org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:441)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
    0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][refresh][T#2]'
     10/10 snapshots sharing following 39 elements
       sun.nio.fs.WindowsNativeDispatcher.DeleteFile0(Native Method)
       sun.nio.fs.WindowsNativeDispatcher.DeleteFile(WindowsNativeDispatcher.java:114)
       sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:249)
       sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
       java.nio.file.Files.delete(Files.java:1126)
       org.apache.lucene.store.FSDirectory.privateDeleteFile(FSDirectory.java:373)
       org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:335)
       org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:62)
       org.apache.lucene.store.FilterDirectory.deleteFile(FilterDirectory.java:62)
       org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:700)
       org.elasticsearch.index.store.Store$StoreDirectory.deleteFile(Store.java:705)
       org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38)
       org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:723)
       org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:717)
       org.apache.lucene.index.IndexFileDeleter.deleteNewFiles(IndexFileDeleter.java:693)
       org.apache.lucene.index.IndexWriter.deleteNewFiles(IndexWriter.java:4965)
       org.apache.lucene.index.DocumentsWriter$DeleteNewFilesEvent.process(DocumentsWriter.java:771)
       org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5043)
       org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5034)
       org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:477)
       org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:291)
       org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:266)
       org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:256)
       org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:104)
       org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156)
       org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
       org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:910)
       org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:632)
       org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:690)
       org.elasticsearch.index.IndexService.access$400(IndexService.java:92)
       org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:832)
       org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:743)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)
   
    0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][flush][T#4334]'
     10/10 snapshots sharing following 16 elements
       org.apache.lucene.index.IndexWriter.setLiveCommitData(IndexWriter.java:3116)
       org.elasticsearch.index.engine.InternalEngine.commitIndexWriter(InternalEngine.java:1562)
       org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1063)
       org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:780)
       org.elasticsearch.indices.flush.SyncedFlushService.performPreSyncedFlush(SyncedFlushService.java:414)
       org.elasticsearch.indices.flush.SyncedFlushService.access$1000(SyncedFlushService.java:70)
       org.elasticsearch.indices.flush.SyncedFlushService$PreSyncedFlushTransportHandler.messageReceived(SyncedFlushService.java:696)
       org.elasticsearch.indices.flush.SyncedFlushService$PreSyncedFlushTransportHandler.messageReceived(SyncedFlushService.java:692)
       org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

In our side, it seems that a thread has been lock for a while on a file deletion on the Lucene side. We still didn't find the reason blocking the deletion of this file (we tried excluding Antivirus, stopping backup etc..) so we just keep restarting node when this is happening ...

DavidTurner · August 11, 2021, 10:34am

Elasticsearch (well, Lucene) assumes that deleting a file should be fast so it's ok to block a thread while it happens. Unfortunately I've also no idea why it should be taking so much longer than expected in your case, but it's definitely a problem in your environment.

Rayzbam · August 11, 2021, 5:24pm

I can't exclude a problem in our environment of course.
We tried shutting off ALL services on the server excepted the ElasticSearch one + shutting off Antivirus + working locally , and these locks were still there..
I mean, i'm not a JAVA expert, but don't you think it's possible that some threads should be locked on a semaphore lock acquirement for deleting a Lucene file or something ?
Maybe an issue in our Elastic configuration i don't really at this point to be honest.

BradVido · August 11, 2021, 5:36pm

Both @Rayzbam and I are running our ES clusters on Windows Server 2019, so this could be something unique to the way Java & Windows interact.

DavidTurner · August 11, 2021, 6:53pm

Not to say that there's no locking going on, but we definitely see that the thread is not waiting for a monitor or anything, it's stuck deleting a file for far too long:

Rayzbam:

    0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[PRODUCTION_CRITQUE_2][refresh][T#2]'
     10/10 snapshots sharing following 39 elements
       sun.nio.fs.WindowsNativeDispatcher.DeleteFile0(Native Method)
       sun.nio.fs.WindowsNativeDispatcher.DeleteFile(WindowsNativeDispatcher.java:114)
       sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:249)
       sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
       java.nio.file.Files.delete(Files.java:1126)

sun.nio.fs.WindowsNativeDispatcher.DeleteFile0 is a native method whose implementation in the JDK is this:

github.com

openjdk/jdk/blob/jdk-16+36/src/java.base/windows/native/libnio/fs/WindowsNativeDispatcher.c#L261-L268


      
          JNIEXPORT void JNICALL
          Java_sun_nio_fs_WindowsNativeDispatcher_DeleteFile0(JNIEnv* env, jclass this, jlong address)
          {
              LPCWSTR lpFileName = jlong_to_ptr(address);
              if (DeleteFileW(lpFileName) == 0) {
                  throwWindowsException(env, GetLastError());
              }
          }

Pretty much the only thing it could be doing is waiting for DeleteFileW to return.

I don't know a huge amount about debugging this kind of thing in Windows (vs say Linux) but I believe that Process Monitor should show these delete calls taking a long time too.

Rayzbam · August 12, 2021, 8:06am

Unfortunately, Process Monitor is useful when you know which file is locked. ElasticSearch hot_threads logs don't give us any clue on that actually.
I'm not very familar with Process Monitor to be honest, but last time i had this issue i tried it but didn't find any locked file by java.exe.
I'll give you another catch when it happening again (should be soon enough to have this topic still open).

DavidTurner · August 12, 2021, 8:18am

I believe you can set up the filters just to report all DeleteFileW calls, there shouldn't be too many of them. TBC I don't think it's anything about a file being locked, the file it's deleting here won't have been locked. At least not by Lucene/Elasticsearch.

If it's genuinely completely stuck on deleting a single file, rather than just going really slowly through the list, then a heap dump of Elasticsearch would tell you the file name.

Rayzbam · August 12, 2021, 11:32am

Thanks @DavidTurner.

I'll try this the next time this issue occurs. I'll let you know if i find out something useful.

BradVido · August 17, 2021, 3:24pm

I just had this issue occur. I have 1649 queued threads in the refresh pool on my elastic2 node.

My call to GET /_nodes/elastic2/hot_threads look like this:

::: {elastic2}{hRztlqGeRTGFFF-0_eQHpQ}{9kjCk6_QRAuR1V1-Ls5Nlw}{10.69.5.21}{10.69.5.21:9300}{dimrt}{xpack.installed=true, transform.node=true}
   Hot threads at 2021-08-17T15:24:05.048Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   34.4% (171.8ms out of 500ms) cpu usage by thread 'elasticsearch[elastic2][management][T#4]'
     3/10 snapshots sharing following 14 elements
       app//org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1048)
       app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
       app//org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:110)
       app//org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:48)
       app//org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:427)
       app//org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:405)
       app//org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:392)
       app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
       app//org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:263)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)
     7/10 snapshots sharing following 9 elements
       java.base@14.0.1/jdk.internal.misc.Unsafe.park(Native Method)
       java.base@14.0.1/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:252)
       java.base@14.0.1/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:740)
       java.base@14.0.1/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:684)
       java.base@14.0.1/java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1374)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1055)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1116)
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)

Then I call it again, and it looks like this:

::: {elastic2}{hRztlqGeRTGFFF-0_eQHpQ}{9kjCk6_QRAuR1V1-Ls5Nlw}{10.69.5.21}{10.69.5.21:9300}{dimrt}{xpack.installed=true, transform.node=true}
   Hot threads at 2021-08-17T15:22:11.364Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    6.3% (31.2ms out of 500ms) cpu usage by thread 'elasticsearch[elastic2][write][T#1]'
     9/10 snapshots sharing following 2 elements
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)
   
    6.3% (31.2ms out of 500ms) cpu usage by thread 'elasticsearch[elastic2][write][T#4]'
     10/10 snapshots sharing following 2 elements
       java.base@14.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@14.0.1/java.lang.Thread.run(Thread.java:832)

Every time I call it, it has differing responses, sometimes even without any stacktraces, and sometimes with many.

This makes me think that there isn't a single stuck thread causing the massive refresh queue

BradVido · August 17, 2021, 3:40pm

Also, here is my call to GET /_nodes/elastic2/stats

I don't see a smoking gun, but also am not sure what to look for

DavidTurner · August 17, 2021, 5:59pm

Maybe try GET _nodes/elastic2/hot_threads?threads=99999&ignore_idle_threads=false. If there's items in the refresh queue then there will be no idle threads in the pool, but they may not be in the top 3 hottest ones.

Rayzbam · August 18, 2021, 9:22am

@DavidTurner Do you think our "DeleteFile" issue could be a swapping issue ? We didn't disabled swapping on our servers, and as you said, the only stacktrace we have is the "inner" Java deletion which only try to find the pointer to delete thanks to the file passed to the method.

Thanks for your help.

DavidTurner · August 18, 2021, 9:54am

I wouldn't think so, but this is beyond my knowledge of Windows. Maybe. Is there much swapping happening? No idea how you'd find out, sorry.

By "find the pointer" do you mean you think it's stuck in jlong_to_ptr? This is just a basic pointer-casting macro, it won't be stuck there:

github.com

openjdk/jdk/blob/jdk-16+36/src/java.base/windows/native/libjava/jlong_md.h#L66-L74


      
          #ifdef _WIN64
          #define jlong_to_ptr(a) ((void*)(a))
          #define ptr_to_jlong(a) ((jlong)(a))
          #else
          /* Double casting to avoid warning messages looking for casting of */
          /* smaller sizes into pointers */
          #define jlong_to_ptr(a) ((void*)(int)(a))
          #define ptr_to_jlong(a) ((jlong)(int)(a))
          #endif

BradVido · September 13, 2021, 2:34pm

This happened again to me.

I ran the command below (as requested by @DavidTurner)

The results are here: https://pastebin.com/raw/7PJmL8SF

From what I see, there are threads stuck on
java.base@14.0.1/sun.nio.fs.WindowsNativeDispatcher.DeleteFile0(Native Method)

Is there any further info that I can get to help troubleshoot this annoying problem?

DavidTurner · September 13, 2021, 3:45pm

As mentioned above DeleteFile0 is a very simple function that just calls into the Windows API function DeleteFileW so the question is why is DeleteFileW taking such a long time to return. Whatever it's doing it's not running any Elasticsearch code at this point.

Unfortunately I have no idea how to even start investigating system-level questions like that more deeply on Windows.

danhermann · September 13, 2021, 10:10pm

@Rayzbam, for problems deleting files on Windows, AV and the Index Service are common culprits though the latter is less likely on server editions of Windows. I've found the Process Explorer tool to be quite helpful in identifying locks on files since you can search by directory name or filename across all processes. I would do that on the data directory on the Elasticsearch node that is stuck deleting the file to see if any other processes are accessing any of the files in that directory.

Rayzbam · September 21, 2021, 10:34am

Thanks @danhermann I'll try it the next time i got the issue.
I found the file which has been stuck the last time the issue occurs.

It seems to be a temporary file generated by Lucene.. I have no idea why the deleteFile is stuck on it for ever.

system · October 19, 2021, 10:34am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch thread_pool is growing over and over Elasticsearch	6	426	November 4, 2022
Large number of queued refresh threads on a single node, queue count keeps increasing Elasticsearch	1	364	August 9, 2021
Increase in size of refresh queue size, increasing the write thread pool and queue on multiple data nodes Elasticsearch	6	621	January 16, 2023
High number of refresh threads Elasticsearch	4	995	February 24, 2020
TransportClient behavior when the server node is not availble Elasticsearch	4	746	July 6, 2017

Massive queue in "refresh" thread pool on a single node, causing timeouts

Related topics