Monitor tasks piling up in master node (6.8.3)

We have a cluster of three dedicated masters, 19 data, and two coordinating nodes running version 6.8.3. We've been having infrequent, but ongoing, issues where cluster:monitor/stats tasks start piling up on the active master node, sometimes eventually recovering, sometimes only being cleared by a full cluster restart. The basic index and search functionality of the cluster seems unaffected, but Kibana goes red since its health checks time out, which makes trouble shooting that much harder. Last night it got so bad (thousands of overdue tasks) that the coordinating nodes lost connection (really don't understand why just those).

I can't find anyone else running into this, so it must be something we're doing. Any suggestions to help get to the bottom of this?

Here's some typical log messages:

[2019-12-19T20:03:22,757][WARN ][o.e.t.TransportService   ] [esmaster02] Received response for a request that has timed out, sent [1723779ms] ago, timed out [1708770ms] ago, action [cluster:monitor/nodes/stats[n]], node [{esworker56}{JLkl7mb6QdmXhBFt5LYdeQ}{-LX62WdJTfmMJAkXLvjrgw}{}{ml.machine_memory=67378266112, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], id [308828]
[2019-12-19T20:03:40,850][WARN ][o.e.t.TransportService   ] [esmaster02] Transport response handler not found of id [308823]
[2019-12-19T20:08:08,974][WARN ][o.e.c.InternalClusterInfoService] [esmaster02] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout

Here's what jstack says the management threads are busy doing on one of the worker nodes:

elasticsearch[esworker49][management][T#1]" #172 daemon prio=5 os_prio=0 cpu=18499108.54ms elapsed=19817.97s tid=0x00007f70d4005800 nid=0x175 runnable  [0x00007f6f2c3df000]
   java.lang.Thread.State: RUNNABLE
	at java.util.TreeMap.getEntry(java.base@12.0.2/TreeMap.java:357)
	at java.util.TreeMap.get(java.base@12.0.2/TreeMap.java:277)
	at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.terms(BlockTreeTermsReader.java:265)
	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms(PerFieldPostingsFormat.java:307)
	at org.apache.lucene.index.CodecReader.terms(CodecReader.java:106)
	at org.apache.lucene.index.FilterLeafReader.terms(FilterLeafReader.java:352)
	at org.elasticsearch.index.engine.Engine.completionStats(Engine.java:217)
	at org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1073)
	at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
--
"elasticsearch[esworker49][management][T#2]" #173 daemon prio=5 os_prio=0 cpu=18507085.61ms elapsed=19817.97s tid=0x00007f70d4007800 nid=0x176 runnable  [0x00007f6f2c2de000]
   java.lang.Thread.State: RUNNABLE
	at java.util.TreeMap.getEntry(java.base@12.0.2/TreeMap.java:357)
	at java.util.TreeMap.get(java.base@12.0.2/TreeMap.java:277)
	at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.terms(BlockTreeTermsReader.java:265)
	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms(PerFieldPostingsFormat.java:307)
	at org.apache.lucene.index.CodecReader.terms(CodecReader.java:106)
	at org.apache.lucene.index.FilterLeafReader.terms(FilterLeafReader.java:352)
	at org.elasticsearch.index.engine.Engine.completionStats(Engine.java:217)
	at org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1073)
	at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
--
"elasticsearch[esworker49][management][T#3]" #174 daemon prio=5 os_prio=0 cpu=18500880.88ms elapsed=19817.97s tid=0x00007f70d4009800 nid=0x177 runnable  [0x00007f6f2c1de000]
   java.lang.Thread.State: RUNNABLE
	at java.io.UnixFileSystem.canonicalize0(java.base@12.0.2/Native Method)
	at java.io.UnixFileSystem.canonicalize(java.base@12.0.2/UnixFileSystem.java:158)
	at java.io.File.getCanonicalPath(java.base@12.0.2/File.java:618)
	at java.io.FilePermission$2.run(java.base@12.0.2/FilePermission.java:388)
	at java.io.FilePermission$2.run(java.base@12.0.2/FilePermission.java:376)
	at java.security.AccessController.executePrivileged(java.base@12.0.2/AccessController.java:751)
	at java.security.AccessController.doPrivileged(java.base@12.0.2/AccessController.java:310)
	at java.io.FilePermission.init(java.base@12.0.2/FilePermission.java:376)
	at java.io.FilePermission.<init>(java.base@12.0.2/FilePermission.java:475)
--
"elasticsearch[esworker49][management][T#4]" #175 daemon prio=5 os_prio=0 cpu=18492156.07ms elapsed=19817.97s tid=0x00007f70d400b800 nid=0x178 runnable  [0x00007f6ee0ffa000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.ramBytesUsed(BlockTreeTermsReader.java:294)
	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.ramBytesUsed(PerFieldPostingsFormat.java:326)
	at org.apache.lucene.index.CodecReader.ramBytesUsed(CodecReader.java:214)
	at org.elasticsearch.index.engine.Engine.fillSegmentStats(Engine.java:849)
	at org.elasticsearch.index.engine.Engine.segmentsStats(Engine.java:831)
	at org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1053)
	at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:213)
	at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:125)
	at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:54)
--
"elasticsearch[esworker49][management][T#5]" #176 daemon prio=5 os_prio=0 cpu=18476090.45ms elapsed=19817.97s tid=0x00007f70d400d800 nid=0x179 runnable  [0x00007f6ee0ef9000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.ramBytesUsed(BlockTreeTermsReader.java:294)
	at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.ramBytesUsed(PerFieldPostingsFormat.java:326)
	at org.apache.lucene.index.CodecReader.ramBytesUsed(CodecReader.java:214)
	at org.elasticsearch.index.engine.Engine.fillSegmentStats(Engine.java:849)
	at org.elasticsearch.index.engine.Engine.segmentsStats(Engine.java:831)
	at org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1053)
	at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:213)
	at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:125)
	at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:54)

Follow-up: after a night of nodes crashing with OOM from this (?!) we completely shut the cluster down and rebooted all machines (just restarting the ES processes did not fix it). The only guess we have is that something went wrong with the NFS mounts for the snapshot volume, as there was some NAS maintenance around the time the issues started. The mounts weren't stale though, as I had previously checked that I could stat a file. But maybe the cluster monitoring was hitting something while checking disk usage?

Seems like a bug that ES would let the tasks queue up so long, while losing track of the requests themselves (eventually the logs filled up with messages like "Transport response handler not found of id [314053]"). And that the monitoring task queue could so affect the cluster that nodes start crashing is very concerning.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.