We have a cluster of three dedicated masters, 19 data, and two coordinating nodes running version 6.8.3. We've been having infrequent, but ongoing, issues where cluster:monitor/stats
tasks start piling up on the active master node, sometimes eventually recovering, sometimes only being cleared by a full cluster restart. The basic index and search functionality of the cluster seems unaffected, but Kibana goes red since its health checks time out, which makes trouble shooting that much harder. Last night it got so bad (thousands of overdue tasks) that the coordinating nodes lost connection (really don't understand why just those).
I can't find anyone else running into this, so it must be something we're doing. Any suggestions to help get to the bottom of this?
Here's some typical log messages:
[2019-12-19T20:03:22,757][WARN ][o.e.t.TransportService ] [esmaster02] Received response for a request that has timed out, sent [1723779ms] ago, timed out [1708770ms] ago, action [cluster:monitor/nodes/stats[n]], node [{esworker56}{JLkl7mb6QdmXhBFt5LYdeQ}{-LX62WdJTfmMJAkXLvjrgw}{}{ml.machine_memory=67378266112, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], id [308828]
[2019-12-19T20:03:40,850][WARN ][o.e.t.TransportService ] [esmaster02] Transport response handler not found of id [308823]
[2019-12-19T20:08:08,974][WARN ][o.e.c.InternalClusterInfoService] [esmaster02] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
Here's what jstack says the management threads are busy doing on one of the worker nodes:
elasticsearch[esworker49][management][T#1]" #172 daemon prio=5 os_prio=0 cpu=18499108.54ms elapsed=19817.97s tid=0x00007f70d4005800 nid=0x175 runnable [0x00007f6f2c3df000]
java.lang.Thread.State: RUNNABLE
at java.util.TreeMap.getEntry(java.base@12.0.2/TreeMap.java:357)
at java.util.TreeMap.get(java.base@12.0.2/TreeMap.java:277)
at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.terms(BlockTreeTermsReader.java:265)
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms(PerFieldPostingsFormat.java:307)
at org.apache.lucene.index.CodecReader.terms(CodecReader.java:106)
at org.apache.lucene.index.FilterLeafReader.terms(FilterLeafReader.java:352)
at org.elasticsearch.index.engine.Engine.completionStats(Engine.java:217)
at org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1073)
at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
--
"elasticsearch[esworker49][management][T#2]" #173 daemon prio=5 os_prio=0 cpu=18507085.61ms elapsed=19817.97s tid=0x00007f70d4007800 nid=0x176 runnable [0x00007f6f2c2de000]
java.lang.Thread.State: RUNNABLE
at java.util.TreeMap.getEntry(java.base@12.0.2/TreeMap.java:357)
at java.util.TreeMap.get(java.base@12.0.2/TreeMap.java:277)
at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.terms(BlockTreeTermsReader.java:265)
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms(PerFieldPostingsFormat.java:307)
at org.apache.lucene.index.CodecReader.terms(CodecReader.java:106)
at org.apache.lucene.index.FilterLeafReader.terms(FilterLeafReader.java:352)
at org.elasticsearch.index.engine.Engine.completionStats(Engine.java:217)
at org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1073)
at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
--
"elasticsearch[esworker49][management][T#3]" #174 daemon prio=5 os_prio=0 cpu=18500880.88ms elapsed=19817.97s tid=0x00007f70d4009800 nid=0x177 runnable [0x00007f6f2c1de000]
java.lang.Thread.State: RUNNABLE
at java.io.UnixFileSystem.canonicalize0(java.base@12.0.2/Native Method)
at java.io.UnixFileSystem.canonicalize(java.base@12.0.2/UnixFileSystem.java:158)
at java.io.File.getCanonicalPath(java.base@12.0.2/File.java:618)
at java.io.FilePermission$2.run(java.base@12.0.2/FilePermission.java:388)
at java.io.FilePermission$2.run(java.base@12.0.2/FilePermission.java:376)
at java.security.AccessController.executePrivileged(java.base@12.0.2/AccessController.java:751)
at java.security.AccessController.doPrivileged(java.base@12.0.2/AccessController.java:310)
at java.io.FilePermission.init(java.base@12.0.2/FilePermission.java:376)
at java.io.FilePermission.<init>(java.base@12.0.2/FilePermission.java:475)
--
"elasticsearch[esworker49][management][T#4]" #175 daemon prio=5 os_prio=0 cpu=18492156.07ms elapsed=19817.97s tid=0x00007f70d400b800 nid=0x178 runnable [0x00007f6ee0ffa000]
java.lang.Thread.State: RUNNABLE
at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.ramBytesUsed(BlockTreeTermsReader.java:294)
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.ramBytesUsed(PerFieldPostingsFormat.java:326)
at org.apache.lucene.index.CodecReader.ramBytesUsed(CodecReader.java:214)
at org.elasticsearch.index.engine.Engine.fillSegmentStats(Engine.java:849)
at org.elasticsearch.index.engine.Engine.segmentsStats(Engine.java:831)
at org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1053)
at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:213)
at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:125)
at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:54)
--
"elasticsearch[esworker49][management][T#5]" #176 daemon prio=5 os_prio=0 cpu=18476090.45ms elapsed=19817.97s tid=0x00007f70d400d800 nid=0x179 runnable [0x00007f6ee0ef9000]
java.lang.Thread.State: RUNNABLE
at org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.ramBytesUsed(BlockTreeTermsReader.java:294)
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.ramBytesUsed(PerFieldPostingsFormat.java:326)
at org.apache.lucene.index.CodecReader.ramBytesUsed(CodecReader.java:214)
at org.elasticsearch.index.engine.Engine.fillSegmentStats(Engine.java:849)
at org.elasticsearch.index.engine.Engine.segmentsStats(Engine.java:831)
at org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:1053)
at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:213)
at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:125)
at org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:54)