High CPU Usage on a few data nodes / Hotspotting of data

Lakshya_Gupta · December 11, 2025, 12:52pm

We are using 63 data nodes, 3 master and 5 data nodes…sharing the machine specs for each category:

data node: 320gb ssd, 20 cores, 53gb memory per node
co-ordinator node: 4 cores, 53gb memory per node
master node: 4 cores, 24gb memory per node

Christian_Dahlqvist · December 11, 2025, 1:06pm

Odd sizes. Are these VMs?

Is the SSD local or accessed via network?

RainTown · December 11, 2025, 1:07pm

Since you wrote:

I presume we are talking here about either some VMs, or cloud instances, or similar, right?

[ *Life was so much simpler when ppl could say "it's a Dell PowerServe 220, or a Hp ProLiant G36, ...". ]

That 320gb SSD, is it locally attached (in the very real sense) to the host and passed thru to the VM, or its some AWS "EBS storage type", or ... ? Or is it maybe "320gb from a large pool of SSD storage provisioned by the VMware team to the VMs", like happens in a typical corporate environments. The point isn't really to know the precise detail, but to get a sense.

And you have access to run commands like iostat on the data nodes? if so, what do they show?

btw, I am a details guy, and

Not exactly recent data, is it?

Lakshya_Gupta · December 11, 2025, 1:10pm

Yes, these are VMs…we have the capability to configure machines of any specs.

For local or network, let me get back on that.

Lakshya_Gupta · December 11, 2025, 1:14pm

Yes, we are using VMs…let me get back with the exact details of the questions you’ve asked…and as for the hot threads part, yeah, it’s not recent, but it’s the same even now…I just shared the log which I had copied and pasted in my tracker.

Christian_Dahlqvist · December 11, 2025, 1:18pm

As @RainTown pointed out how you access your SSD storage is very important. On AWS there is several tiers of SSD backed EBS and the cheapest types do not provide anywhere near the performance you would see from a local SSD of good quality.

If you can try to run iostat on the hot data nodes and check await, IOPS and other I/O related statistics.

Lakshya_Gupta · December 12, 2025, 9:50am

The 320gb SSD is locally attached to the Elasticsearch process and yes, we have access to run the iostat command.

Do you want to see the iostat response during the periods of CPU spike, or shall I provide you the current point in time information?

Lakshya_Gupta · December 12, 2025, 9:52am

Yes sure, I can do that…sharing the current point in time logs (currently cpu is around 30%).

Linux 5.10.0-33-cloud-amd64  12/12/25 _x86_64_ (20 CPU)



avg-cpu:  %user   %nice %system %iowait  %steal   %idle

          11.49    0.14    1.14    0.08    0.00   87.13



Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd

vda               8.01         1.79        42.89        15.08   30514728  730631874  256923132

vdb             249.06       149.36      4320.12       124.47 2544293144 73589851768 2120169144

RainTown · December 12, 2025, 11:18am

249 tps is not a lot, but we dont know if it was under any stress for that period.

Please run on the 15 "hot" nodes:

iostat -t -c -d /dev/vdb -x 10 360

This will take an hour. Hopefully it will include a period where the node is showing high (90%+) CPU. if it doesn't, then run it again for another hour. repeat until you get an hour where there such a period, and please share that hours data for the hosts impacted.

Lakshya_Gupta · December 12, 2025, 11:41am

Sure, thanks a lot for the command, I’ll run it and share the results.

Lakshya_Gupta · December 12, 2025, 2:02pm

This is what we got at around a 60% cpu spike…I’ll try to share one for 90% as well, but seems like we didn’t get enough writes today haha.


12/12/25 19:04:30
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          30.42    0.27    3.32    0.55    0.12   65.32

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vdb            711.00  48579.20     0.00   0.00    0.23    68.33  533.30   7262.80  1280.80  70.60    0.14    13.62    0.00      0.00     0.00   0.00    0.00     0.00  246.40    0.08    0.26  54.52


12/12/25 19:04:40
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          30.33    0.21    4.15    0.60    0.12   64.58

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vdb            494.30  40285.60     0.00   0.00    0.24    81.50  832.10   7205.60   947.90  53.25    0.11     8.66    0.00      0.00     0.00   0.00    0.00     0.00  480.40    0.07    0.25  66.32


12/12/25 19:04:50
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          25.46    0.00    3.21    0.84    0.14   70.35

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vdb            370.80  32260.40     0.00   0.00    0.27    87.00 1736.60  42340.00  5236.80  75.10    0.13    24.38    0.00      0.00     0.00   0.00    0.00     0.00  869.30    0.08    0.39  65.16


12/12/25 19:05:00
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.40    0.26    6.25    1.01    0.26   53.82

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
vdb            655.90  44006.00     0.00   0.00    0.36    67.09 1745.90  84306.80 14319.00  89.13    0.24    48.29    0.00      0.00     0.00   0.00    0.00     0.00  775.80    0.13    0.77  73.76

RainTown · December 12, 2025, 2:58pm

lets see what higher CPU load stats says, but already:

You say "around a 60% cpu spike" but the logs show CPU, and highest I see is 38%, measured over 10 seconds. How long was the spike to "60%"? A couple of seconds?

The last set shows %util at 73%. That means for 73% of that 10 second, there was an IO request being processed. r/s : w/s : f/s were 655.90 : 1745.90 : 775.80. Thats respectable, but nothing remarkable and it's already at 73% util.

Whats the refresh_interval on the relevant index? Are you using refresh=true?

Lakshya_Gupta · December 12, 2025, 4:22pm

The refresh interval is 30seconds

Lakshya_Gupta · December 12, 2025, 4:23pm

If possible could you please explain a bit about what these terminologies mean and how do we determine if these can lead to a cpu spike or not , that will help us in understanding elastic in more depth as well.

Christian_Dahlqvist · December 12, 2025, 4:29pm

Does this mean implementing bulk indexing requests as outlined here?

How are you batching up the data? Are you using a message queue or some other method?

Can you provide some more details about what you did in this layer and how it works?

Is it related to the comment I made about frequent updates to individual documents?

Lakshya-Gupta · December 12, 2025, 4:41pm

We are batching the data in 2 ways -

First we have added a sort of delayed queue where we are creating a delay of 30 seconds before sending the requests to the downstream. This helps in batching duplicate results.
Secondly, when polling from kafka, we poll a batch of documents and upsert it in a single call.

Lakshya-Gupta · December 12, 2025, 4:43pm

It’s a redis cache wherein we are using a key-value sort of data structure so that the data for 1 key doesn’t create another upsert request for another 30 seconds…this is because in our case whenever we experience a higher number of writes, 90% of the IDs are duplicate, therefore this key-value sort of data structure works.

Lakshya-Gupta · December 12, 2025, 5:28pm

@RainTown as we disuccsed on the pm, can you please guide me how this can be an IOPS issue. I don’t have much knowledge about that, if you could please suggest me how to go in depth and figure out if it’s actually an IOPS issue. And if yes, then how can we aim to solve it?

RainTown · December 12, 2025, 5:31pm

That bit is easy. Get faster disks.

Christian_Dahlqvist · December 12, 2025, 5:39pm

OK. Then it sounds like you have addressed the potential issue around frequent updates and implemented bulk requests.

You mention that you have a high indexing load. What does the query load look like? How many concurrent queries do you need to support? What is your latency target/limit? How is the cluster performing from a query perspective now (for both hot and cold IDs)?

Topic		Replies	Views
Elasticsearch 7.17.10 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS Elasticsearch	53	1606	June 22, 2023
High CPU load Elasticsearch	10	927	May 10, 2022
Elasticsearch 8.9.1 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS using ECK Elasticsearch	11	976	October 30, 2023
Abnormally high CPU usage for specific queries/dashboards Elasticsearch	2	147	May 15, 2024
Cluster resource usage Elasticsearch	14	454	July 6, 2017

High CPU Usage on a few data nodes / Hotspotting of data

Related topics