Hi Elasticsearch community, does Elasticsearch cleanly close the per-index threads after deleting an index?
ES Version: 7.6.1
I have encountered an issue with my Elasticsearch set up, where the thread count slowly grows, While it is growing, I notice end to end search latency increases.
Below is a graph showing the operating system's count of total threads. The "steps up", in the step function, are when I create a new index. Everyday, I create 5 new indices, with a total of 504 shards (~165 primaries, 339 replicas), across 39 nodes, 3 masters, leading to 12-13 new shards per node created each day. And then I delete 5 old indices.
When looking at the threads that stay open each day, I am seeing a total of ~255 threads opened per day (per new index created), that are not closed afterwards.
When I run :
jdk1.8/bin/jmcd [ES_process_id] Thread.print
Here is an excerpt of some of the threads that are left hanging. There are over 13,000 threads.
"elasticsearch[[timer]]" #89201 daemon prio=5 os_prio=0 tid=0x00007f3b3146d800 nid=0xaa4 waiting on condition [0x00007f3466463000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.elasticsearch.threadpool.ThreadPool$CachedTimeThread.run(ThreadPool.java:592)
"elasticsearch[scheduler][T#1]" #89200 daemon prio=5 os_prio=0 tid=0x00007f3b3146c000 nid=0xaa3 waiting on condition [0x00007f346cb61000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f3e5d063388> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:749)
"elasticsearch[[timer]]" #89199 daemon prio=5 os_prio=0 tid=0x00007f3b3146a800 nid=0xaa2 waiting on condition [0x00007f348580d000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.elasticsearch.threadpool.ThreadPool$CachedTimeThread.run(ThreadPool.java:592)
"elasticsearch[scheduler][T#1]" #89198 daemon prio=5 os_prio=0 tid=0x00007f3b31468800 nid=0xaa1 waiting on condition [0x00007f3584ca0000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f3ed2cc74d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:749)
...
"elasticsearch[refresh][T#1]" #89175 daemon prio=5 os_prio=0 tid=0x00007f47e0479000 nid=0x74b waiting on condition [0x00007f346f078000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f3f43436c50> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
at java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
at java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:749)
"elasticsearch[refresh][T#1]" #89174 daemon prio=5 os_prio=0 tid=0x00007f47d82a8800 nid=0x74a waiting on condition [0x00007f346f179000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f3e81695698> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
at java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
at java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:749)
...
"elasticsearch[scheduler][T#1]" #81693 daemon prio=5 os_prio=0 tid=0x00007f3b313de000 nid=0x3115 waiting on condition [0x00007f3469db0000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007f3eec28a730> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:749)
Some other notes:
- On rebooting the Elasticsearch process, the old threads are cleared out, and search latency is reduced (this mitigates my issue, but ideally Elasticsearch closes old threads)
- When googling, I saw some who used Solr and Lucene had problems with continuously open threads, when on the Lucene IndexWriter, it was not .close()'d .
- When looking through the Elasticsearch code, I could not find the place where the IndexEngine or IndexShard closes the Lucene IndexWriter (which should close the associated thread pools)