Elasticsearch thread count leak - threads not closed on index delete

NelsonBurton · November 16, 2021, 2:47am

Hi Elasticsearch community, does Elasticsearch cleanly close the per-index threads after deleting an index?

ES Version: 7.6.1

I have encountered an issue with my Elasticsearch set up, where the thread count slowly grows, While it is growing, I notice end to end search latency increases.

Below is a graph showing the operating system's count of total threads. The "steps up", in the step function, are when I create a new index. Everyday, I create 5 new indices, with a total of 504 shards (~165 primaries, 339 replicas), across 39 nodes, 3 masters, leading to 12-13 new shards per node created each day. And then I delete 5 old indices.

When looking at the threads that stay open each day, I am seeing a total of ~255 threads opened per day (per new index created), that are not closed afterwards.

When I run :
jdk1.8/bin/jmcd [ES_process_id] Thread.print

Here is an excerpt of some of the threads that are left hanging. There are over 13,000 threads.

"elasticsearch[[timer]]" #89201 daemon prio=5 os_prio=0 tid=0x00007f3b3146d800 nid=0xaa4 waiting on condition [0x00007f3466463000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.elasticsearch.threadpool.ThreadPool$CachedTimeThread.run(ThreadPool.java:592)

"elasticsearch[scheduler][T#1]" #89200 daemon prio=5 os_prio=0 tid=0x00007f3b3146c000 nid=0xaa3 waiting on condition [0x00007f346cb61000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007f3e5d063388> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:749)

"elasticsearch[[timer]]" #89199 daemon prio=5 os_prio=0 tid=0x00007f3b3146a800 nid=0xaa2 waiting on condition [0x00007f348580d000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.elasticsearch.threadpool.ThreadPool$CachedTimeThread.run(ThreadPool.java:592)

"elasticsearch[scheduler][T#1]" #89198 daemon prio=5 os_prio=0 tid=0x00007f3b31468800 nid=0xaa1 waiting on condition [0x00007f3584ca0000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007f3ed2cc74d0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:749)

...
"elasticsearch[refresh][T#1]" #89175 daemon prio=5 os_prio=0 tid=0x00007f47e0479000 nid=0x74b waiting on condition [0x00007f346f078000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007f3f43436c50> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
	at java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
	at java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:749)

"elasticsearch[refresh][T#1]" #89174 daemon prio=5 os_prio=0 tid=0x00007f47d82a8800 nid=0x74a waiting on condition [0x00007f346f179000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007f3e81695698> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
	at java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
	at java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:749)
...
"elasticsearch[scheduler][T#1]" #81693 daemon prio=5 os_prio=0 tid=0x00007f3b313de000 nid=0x3115 waiting on condition [0x00007f3469db0000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007f3eec28a730> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:749)

Some other notes:

On rebooting the Elasticsearch process, the old threads are cleared out, and search latency is reduced (this mitigates my issue, but ideally Elasticsearch closes old threads)
When googling, I saw some who used Solr and Lucene had problems with continuously open threads, when on the Lucene IndexWriter, it was not .close()'d .
When looking through the Elasticsearch code, I could not find the place where the IndexEngine or IndexShard closes the Lucene IndexWriter (which should close the associated thread pools)

warkolm · November 16, 2021, 2:54am

Welcome to our community!

7.6 is EOL and no longer supported.
I can't guarantee an upgrade will solve this, as I don't track the release notes to the level needed to know if this was fixed, however you should really upgrade. Latests is 7.15.

DavidTurner · November 16, 2021, 6:38am

It's closed here (at least in recent versions, I don't think 7.6 is any different tho)

github.com

elastic/elasticsearch/blob/12ad399c488f0cc60e19b5e1b29c6d569cb4351a/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java#L2306


      
                  logger.warn("Failed to close ReaderManager", e);
              }
              try {
                  IOUtils.close(translog);
              } catch (Exception e) {
                  logger.warn("Failed to close translog", e);
              }
              // no need to commit in this case!, we snapshot before we close the shard, so translog and all sync'ed
              logger.trace("rollback indexWriter");
              try {
                  indexWriter.rollback();
              } catch (AlreadyClosedException ex) {
                  failOnTragicEvent(ex);
                  throw ex;
              }
              logger.trace("rollback indexWriter done");
          } catch (Exception e) {
              logger.warn("failed to rollback writer on close", e);
          } finally {
              try {
                  store.decRef();

Elasticsearch creates a unique [[timer]] thread and a unique [scheduler] thread at startup, these are not per-index threads. In fact Elasticsearch creates no per-index threads. I suspect you have made some modifications (e.g. you've installed a plugin) and that this is where the problem lies.

system · December 14, 2021, 6:38am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch threadpool and thread count Elasticsearch	1	336	July 6, 2017
Questions on threads in elasticsearch Elasticsearch	3	911	July 5, 2017
How to check number of threads used for indexing in Elasticsearch? Elasticsearch	4	2459	April 26, 2019
Huge number of 'elasticsearch[cached]-pool-1-thread-NNN' threads Elasticsearch	6	1156	July 6, 2017
Number of runnable threads Elasticsearch	1	273	July 6, 2017

Elasticsearch thread count leak - threads not closed on index delete

Related topics