After ingesting some test data in our cluster, we are seeing the data nodes run at over 80% CPU and a system load of between 2-5 on each data node when it is just sitting idle. There is nothing ingesting, nothing querying. Turned off x-pack-monitoring which seems to have helped a little but not a ton.
_cat/nodes
can take up to 1 minute.
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1580322080 18:21:20 storage_cluster green 31 24 2667 1335 0 0 0 0 - 100.0%
# curl -XGET localhost:9200/_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.72.71 32 67 63 5.13 5.10 5.06 d - esdata-0d57
192.168.109.6 37 64 60 5.12 5.14 5.03 d - esdata-021d
192.168.137.62 48 90 1 0.15 0.16 0.15 m * esmaster-013f
192.168.156.214 35 62 63 5.04 5.14 5.09 d - esdata-0628
192.168.102.100 32 64 53 4.96 4.90 4.47 d - esdata-0883
192.168.100.245 46 67 63 5.25 5.28 5.15 d - esdata-0afb
192.168.79.122 35 67 62 4.81 3.58 3.72 d - esdata-06a3
192.168.72.36 43 69 65 5.03 5.10 5.06 d - esdata-0728
192.168.78.109 35 68 62 5.09 5.13 4.90 d - esdata-0ba5
192.168.81.244 27 57 0 0.18 0.13 0.10 i - esingest-04b7
192.168.105.14 28 57 0 0.04 0.08 0.07 i - esingest-06a1
192.168.111.67 60 88 1 0.24 0.20 0.08 m - esmaster-032c
192.168.126.55 36 65 63 5.05 5.21 5.10 d - esdata-0853
192.168.66.57 33 66 61 5.09 5.21 5.02 d - esdata-0b23
192.168.113.31 41 68 63 5.14 4.66 4.18 d - esdata-063a
192.168.158.54 55 63 63 5.09 5.23 4.68 d - esdata-0c13
192.168.142.225 31 62 66 5.12 5.23 5.13 d - esdata-0ff9
192.168.126.225 35 68 59 5.14 5.13 4.89 d - esdata-0d78
192.168.76.146 36 68 63 4.82 5.10 5.09 d - esdata-0f5a
192.168.138.103 26 57 0 0.05 0.11 0.06 i - esingest-0e3b
192.168.132.53 32 61 63 5.12 5.15 5.05 d - esdata-0a2e
192.168.83.129 31 72 0 0.15 0.19 0.18 - - console-0de8
192.168.144.33 36 66 62 5.52 5.26 5.08 d - esdata-0170
192.168.83.121 48 70 63 4.72 4.48 4.58 d - esdata-0b2a
192.168.152.165 34 61 80 5.38 5.25 5.08 d - esdata-0264
192.168.88.178 38 65 66 5.04 5.12 5.04 d - esdata-0a44
192.168.108.201 35 64 63 3.59 4.41 4.76 d - esdata-0fdb
192.168.133.171 36 63 63 5.10 5.22 4.97 d - esdata-05e3
192.168.65.214 65 88 0 0.02 0.06 0.06 m - esmaster-0c99
192.168.122.195 36 64 63 5.13 5.13 5.01 d - esdata-05ed
192.168.147.93 35 61 53 4.70 4.95 4.98 d - esdata-037b
Hot threads pretty much all look like this:
99.1% (495.6ms out of 500ms) cpu usage by thread 'elasticsearch[esdata-0ba5][management][T#1]'
4/10 snapshots sharing following 20 elements
java.base@13/java.util.TreeMap.getEntry(TreeMap.java:357)
java.base@13/java.util.TreeMap.get(TreeMap.java:277) app//org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms(PerFieldPostingsFormat.java:311)
app//org.apache.lucene.index.CodecReader.terms(CodecReader.java:106)
app//org.apache.lucene.index.FilterLeafReader.terms(FilterLeafReader.java:366)
app//org.elasticsearch.index.engine.Engine.completionStats(Engine.java:197)
app//org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1048)
app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:207) app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:120) app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:52) app//org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:129) app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244) app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240) app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) app//org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) java.base@13/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.base@13/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@13/java.lang.Thread.run(Thread.java:830)
4/10 snapshots sharing following 21 elements
java.base@13/java.util.TreeMap.getEntry(TreeMap.java:357)
java.base@13/java.util.TreeMap.get(TreeMap.java:277) app//org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.terms(BlockTreeTermsReader.java:304) app//org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms(PerFieldPostingsFormat.java:312)
app//org.apache.lucene.index.CodecReader.terms(CodecReader.java:106)
app//org.apache.lucene.index.FilterLeafReader.terms(FilterLeafReader.java:366)
app//org.elasticsearch.index.engine.Engine.completionStats(Engine.java:197)
app//org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1048)
app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:207) app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:120) app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:52) app//org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:129) app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244) app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240) app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) app//org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) java.base@13/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.base@13/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@13/java.lang.Thread.run(Thread.java:830)
2/10 snapshots sharing following 17 elements
app//org.apache.lucene.index.CodecReader.terms(CodecReader.java:106)
app//org.apache.lucene.index.FilterLeafReader.terms(FilterLeafReader.java:366)
app//org.elasticsearch.index.engine.Engine.completionStats(Engine.java:197)
app//org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1048)
app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:207) app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:120) app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:52) app//org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:129) app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244) app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240) app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) app//org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) java.base@13/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) java.base@13/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base@13/java.lang.Thread.run(Thread.java:830)
After rebooting all nodes they will slowly climb from 40% CPU until they reach this state. Attempting to query it in this state really just pushes it off the edge and requires a reboot.
Any ideas where to look?