Warm nodes consistently using ~10% of CPU even though the node is basically idle

On a 6 warm nodes which contain 53 indices each of them with 10 shards and 10 segments per index on a quad-core CPU and 26GB of ram(13GB heap size) CPU usage even at deep night isn't low.


We are using hdd for warm nodes and ssd for hot nodes.
ES version 5.6.8
OpenJDK Runtime Environment (build 1.8.0_161-b14)

Hot threads on warm nodes at working hours:

   32.9% (164.4ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm005][management][T#5]'
   33.0% (165ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][management][T#5]'
   33.4% (166.9ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm004][management][T#3]'
   33.6% (167.7ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm004][management][T#5]'
   33.9% (169.4ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm002][management][T#3]'
   34.2% (171ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][management][T#5]'
   35.1% (175.3ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][search][T#6]'
   35.1% (175.4ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][search][T#1]'
   35.2% (176.2ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm002][management][T#4]'
   36.3% (181.6ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm005][search][T#5]'
   37.0% (185.1ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm004][management][T#3]'
   38.2% (190.9ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm005][search][T#6]'
   38.3% (191.5ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm005][management][T#4]'
   39.4% (196.9ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm002][search][T#2]'
   39.7% (198.3ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm006][management][T#3]'
   40.6% (202.9ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm002][management][T#2]'
   41.5% (207.6ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm005][search][T#2]'
   46.5% (232.4ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][management][T#4]'

Hot threads when old index was moving to warm nodes from hot nodes:

   23.7% (118.2ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm004][[transport_server_worker.default]][T#7]'
   15.9% (79.7ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm002][[transport_server_worker.default]][T#6]'
   14.3% (71.4ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm002][[transport_server_worker.default]][T#8]'
   12.9% (64.6ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][[transport_server_worker.default]][T#4]'
    3.7% (18.5ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][generic][T#8]'
   18.2% (91ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][[transport_server_worker.default]][T#1]'
   18.1% (90.7ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][[transport_server_worker.default]][T#4]'
   16.1% (80.4ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm005][[transport_server_worker.default]][T#4]'
    2.1% (10.2ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][management][T#4]'
    6.1% (30.6ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm006][management][T#2]'
    3.8% (18.8ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm006][generic][T#33]'
    3.6% (18.1ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][generic][T#28]'
    2.2% (11ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm004][management][T#4]'
   28.1% (140.5ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm006][[transport_server_worker.default]][T#2]'
   21.1% (105.6ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][management][T#4]'
   19.0% (94.9ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][management][T#5]'
   18.0% (89.8ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][generic][T#15]'
   18.5% (92.4ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm006][[transport_server_worker.default]][T#2]'
   21.1% (105.5ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][management][T#1]'
   19.3% (96.6ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][management][T#4]'
   19.2% (96.2ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][management][T#3]'
   20.4% (102.2ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][generic][T#15]'
    8.5% (42.3ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm005][[transport_server_worker.default]][T#8]'
   23.0% (115.1ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][management][T#2]'

Forcemerge procedure is started:

   100.3% (501.6ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm004][[logstash-2018.04.28][4]: Lucene Merge Thread #0]'
   100.0% (500.2ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm002][[logstash-2018.04.28][1]: Lucene Merge Thread #0]'
   83.9% (419.3ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][[logstash-2018.04.28][2]: Lucene Merge Thread #0]'
   100.2% (501.1ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][[logstash-2018.04.28][1]: Lucene Merge Thread #0]'
   23.1% (115.5ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm002][management][T#4]'
   17.8% (88.9ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm002][management][T#5]'
   100.4% (501.9ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm004][[logstash-2018.04.28][4]: Lucene Merge Thread #0]'
   100.3% (501.5ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm001][[logstash-2018.04.28][2]: Lucene Merge Thread #0]'
   24.1% (120.2ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][management][T#4]'
   22.3% (111.5ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][management][T#3]'
   17.5% (87.6ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][management][T#1]'
   100.1% (500.7ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm006][[logstash-2018.04.28][3]: Lucene Merge Thread #0]'
   100.2% (501.1ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm003][[logstash-2018.04.28][1]: Lucene Merge Thread #0]'
   100.4% (501.9ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm004][[logstash-2018.04.28][4]: Lucene Merge Thread #0]'
   100.2% (501ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm002][[logstash-2018.04.28][1]: Lucene Merge Thread #0]'
   100.3% (501.6ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm006][[logstash-2018.04.28][3]: Lucene Merge Thread #0]'

Why CPU usage doesn't drops below 10% when nodes do nothing - no search, forcemerge or restoring procedures?

What does this thread mean:

   33.6% (167.7ms out of 500ms) cpu usage by thread 'elasticsearch[data-warm004][management][T#5]'

During non-business hours I do see this type of thread which is load the CPU, why?
What management thread doing?

Also, looks like not all nodes receive a search request/not loaded equally ( on screenshots below CPU usage on warm nodes when I was trying to search some data for the last 60 days ):

001

Lumpy search like this isn't terribly uncommon. It basically boils down to the distribution of shards across the nodes, which are needed to satisfy a search request, and which primary/replica shards are chosen to satisfy it.

So the management threadpool is basically miscellaneous house keeping tasks. NodeStats, ClusterStats, remote node management, clear cache, persistent tasks, etc. So a hodge-podge of different activities.

I'm not sure why your utilization stays around 10%. Is there any activity on the cluster (search, indexing, etc)? Or is it completely idle?

Are you calling cluster/node/index stats very frequently? It's a remarkably heavy API since it needs to talk to all the nodes to collect the stats, so if you call it multiple times per second for example it can have a noticeable impact.

Are there any pending tasks? Reindex operations?

Thank you for pointing this out to us.
The problem was with wrong monitoring script which is queried all the nodes every minute for the bunch of metrics. After disabling this monitoring script CPU load dropped to 8-10%. CPU load on warm nodes near 0% even at on-peak hours.

One more question - should the number of shards be equal to number of nodes on which they're located?
I mean it should be some kind of balance between shards and nodes, no?
Currently we have 6 nodes and 10 shards per index, maybe we should increase number of shards to 12 to achieve more correct usage of cpu |i/o|heap?

Ah cool, glad we found it! The stats thing is definitely something that folks don't think about, since it feels like stats should be lightweight to collect. But the distributed nature, plus some internal bookkeeping, makes it reasonably expensive.

Maybe, maybe not. Hard to tell. Evenly divisible number of shards can help them spread more evenly across the cluster... but there's no guarantee that ES will allocate shards evenly. E.g if there's a hiccup and shards need to re-arrange, you may end up with "lumpy" distribution again. Or ES may just decide to allocate several primaries from an index to a single node, due to how the various factors conspire.

I wouldn't worry about it too much. A bigger concern is the total number of shards. 53 indices at 10 shards each, in just a 6-node cluster, is a fair amount of sharding. Each shard has a certain amount of overhead for the node holding the shard (memory, etc) as well as the master (which has to coordinate the cluster state). You might look into reducing the number of shards, depending on how large they are.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.