Cluster sitting idle using 80% CPU, System Load over 5, ES 7.4.1

After ingesting some test data in our cluster, we are seeing the data nodes run at over 80% CPU and a system load of between 2-5 on each data node when it is just sitting idle. There is nothing ingesting, nothing querying. Turned off x-pack-monitoring which seems to have helped a little but not a ton.

_cat/nodes can take up to 1 minute.

epoch      timestamp cluster              status node.total node.data shards  pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1580322080 18:21:20  storage_cluster      green          31        24   2667 1335    0    0        0             0                  -                100.0%
 # curl -XGET localhost:9200/_cat/nodes?v
ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.72.71             32          67  63    5.13    5.10     5.06 d         -      esdata-0d57
192.168.109.6             37          64  60    5.12    5.14     5.03 d         -      esdata-021d
192.168.137.62            48          90   1    0.15    0.16     0.15 m         *      esmaster-013f
192.168.156.214           35          62  63    5.04    5.14     5.09 d         -      esdata-0628
192.168.102.100           32          64  53    4.96    4.90     4.47 d         -      esdata-0883
192.168.100.245           46          67  63    5.25    5.28     5.15 d         -      esdata-0afb
192.168.79.122            35          67  62    4.81    3.58     3.72 d         -      esdata-06a3
192.168.72.36             43          69  65    5.03    5.10     5.06 d         -      esdata-0728
192.168.78.109            35          68  62    5.09    5.13     4.90 d         -      esdata-0ba5
192.168.81.244            27          57   0    0.18    0.13     0.10 i         -      esingest-04b7
192.168.105.14            28          57   0    0.04    0.08     0.07 i         -      esingest-06a1
192.168.111.67            60          88   1    0.24    0.20     0.08 m         -      esmaster-032c
192.168.126.55            36          65  63    5.05    5.21     5.10 d         -      esdata-0853
192.168.66.57             33          66  61    5.09    5.21     5.02 d         -      esdata-0b23
192.168.113.31            41          68  63    5.14    4.66     4.18 d         -      esdata-063a
192.168.158.54            55          63  63    5.09    5.23     4.68 d         -      esdata-0c13
192.168.142.225           31          62  66    5.12    5.23     5.13 d         -      esdata-0ff9
192.168.126.225           35          68  59    5.14    5.13     4.89 d         -      esdata-0d78
192.168.76.146            36          68  63    4.82    5.10     5.09 d         -      esdata-0f5a
192.168.138.103           26          57   0    0.05    0.11     0.06 i         -      esingest-0e3b
192.168.132.53            32          61  63    5.12    5.15     5.05 d         -      esdata-0a2e
192.168.83.129            31          72   0    0.15    0.19     0.18 -         -      console-0de8
192.168.144.33            36          66  62    5.52    5.26     5.08 d         -      esdata-0170
192.168.83.121            48          70  63    4.72    4.48     4.58 d         -      esdata-0b2a
192.168.152.165           34          61  80    5.38    5.25     5.08 d         -      esdata-0264
192.168.88.178            38          65  66    5.04    5.12     5.04 d         -      esdata-0a44
192.168.108.201           35          64  63    3.59    4.41     4.76 d         -      esdata-0fdb
192.168.133.171           36          63  63    5.10    5.22     4.97 d         -      esdata-05e3
192.168.65.214            65          88   0    0.02    0.06     0.06 m         -      esmaster-0c99
192.168.122.195           36          64  63    5.13    5.13     5.01 d         -      esdata-05ed
192.168.147.93            35          61  53    4.70    4.95     4.98 d         -      esdata-037b

Hot threads pretty much all look like this:

99.1% (495.6ms out of 500ms) cpu usage by thread 'elasticsearch[esdata-0ba5][management][T#1]'
     4/10 snapshots sharing following 20 elements
       java.base@13/java.util.TreeMap.getEntry(TreeMap.java:357)
       java.base@13/java.util.TreeMap.get(TreeMap.java:277)       app//org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms(PerFieldPostingsFormat.java:311)
       app//org.apache.lucene.index.CodecReader.terms(CodecReader.java:106)
       app//org.apache.lucene.index.FilterLeafReader.terms(FilterLeafReader.java:366)
       app//org.elasticsearch.index.engine.Engine.completionStats(Engine.java:197)
       app//org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1048)
       app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:207)       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:120)       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:52)       app//org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:129)       app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244)       app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240)       app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)       app//org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264)       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)       java.base@13/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)       java.base@13/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@13/java.lang.Thread.run(Thread.java:830)
     4/10 snapshots sharing following 21 elements
       java.base@13/java.util.TreeMap.getEntry(TreeMap.java:357)
       java.base@13/java.util.TreeMap.get(TreeMap.java:277)       app//org.apache.lucene.codecs.blocktree.BlockTreeTermsReader.terms(BlockTreeTermsReader.java:304)       app//org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms(PerFieldPostingsFormat.java:312)
       app//org.apache.lucene.index.CodecReader.terms(CodecReader.java:106)
       app//org.apache.lucene.index.FilterLeafReader.terms(FilterLeafReader.java:366)
       app//org.elasticsearch.index.engine.Engine.completionStats(Engine.java:197)
       app//org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1048)
       app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:207)       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:120)       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:52)       app//org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:129)       app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244)       app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240)       app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)       app//org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264)       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)       java.base@13/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)       java.base@13/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@13/java.lang.Thread.run(Thread.java:830)
     2/10 snapshots sharing following 17 elements
       app//org.apache.lucene.index.CodecReader.terms(CodecReader.java:106)
       app//org.apache.lucene.index.FilterLeafReader.terms(FilterLeafReader.java:366)
       app//org.elasticsearch.index.engine.Engine.completionStats(Engine.java:197)
       app//org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:1048)
       app//org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:207)       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:120)       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:52)       app//org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:129)       app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244)       app//org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240)       app//org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)       app//org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264)       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773)       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)       java.base@13/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)    java.base@13/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       java.base@13/java.lang.Thread.run(Thread.java:830)

After rebooting all nodes they will slowly climb from 40% CPU until they reach this state. Attempting to query it in this state really just pushes it off the edge and requires a reboot.

Any ideas where to look?

I would suggest reducing the number of shards by 10x-100x, if you can.

Turns out it was one bad index. Deleted it and immediately everything settled across all nodes.
At 20 shards per GB * 32 GB heap * 24 nodes we should be able to support 14,440 shards, so well below suggested limits. We had a lot of data with forced document id's were index'd twice. Which we believe may have caused the bad index.
Currently the shards were definitely too small, winding up well below the 20-50GB levels, but we had only begun to index when this happened.

Our guess is that the duplicated document id's caused some issue with ES that forced it into this bad behaviour. The big oddity was once we deleted the 17 day old index, suddenly ES reported it index'd 9,000 documents into that index .... even though everything had been turned off for days and the cluster had no way to get new documents. It's like it was holding on to documents that still needed to be index'd and would blow itself up rather than let them go

so, this means it is not reproducible currently? Did that one index have a completion suggest field?

Actually, it's back. We're finding if we ingest more than say 50k documents/second we don't get any 429's, the cluster just falls over and then goes into this "bad state". Not sure if this is some other bad index or not. But at 35k we saw about 40% CPU across all data nodes. Increase the ingestion rate and it falls over. Then we can stop ingestion and sometimes it will recover in about 20 minutes, other times it just stays at 80% CPU across data nodes until we restart it (even overnight).

We're running 5 clusters similar to this and it looks like we can get any of them into this state. Going to try G1GC, but the lack of 429's to our writers is odd.

I'll go look at that completion suggest field. Do you have to add that explicitly in mappings?
EDIT: No suggesters in the mappings at all

So we found that we do not need to restart all the cluster, restarting just the nodes with a massive management queue restores the state temporarily

node_name        name       active queue rejected
esdata-i-021d    management      5     8        0
esdata-i-0d57    management      5     5        0
esdata-i-0170    management      5     5        0
esmaster-i-013f  management      1     0        0
esdata-i-0a2e    management      5     7        0
esdata-i-0c13    management      5   202        0
esdata-i-0a44    management      5     5        0
esquery-i-078a   management      1     0        0
esquery-i-09d2   management      1     0        0
esdata-i-0264    management      5     5        0
esdata-i-0f5a    management      5  5914        0
esdata-i-0fdb    management      5     8        0
esdata-i-0b23    management      5     7        0
esingest-i-0e3b  management      1     0        0
console-i-0de8   management      1     0        0
<snipped>

We found restarting es on one would sometimes clear queues on another as well. Wondering if we tracked which indices are on these nodes it may provide a hint.

But with nothing ingesting, it seems weird that the management thread pools won't clear unless they are locked on something.

G1GC helped a bit, but still having clusters suddenly shoot to 80% cpu across all data nodes and fall over. Still not a 429 in site. No evidence in INFO logs of anything going wrong before it does. Will put one in trace mode to see if there's a clue.

Now seeing clusters indexing leisurely at 15k documents/second for days suddenly fall over.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.