High CPU Usage on a few data nodes / Hotspotting of data

Lakshya_Gupta · December 15, 2025, 1:10pm

Do you recommend switching to XFS instead?

leandrojmp · December 15, 2025, 1:52pm

I don't think this will make any difference, besides that, to make this change you would need to reinstall and reconfigure your nodes.

Being honest, I'm not sure you will be able to solve this issue as it seems to be a problem with the design choice, you mentioned that you are having this problem for the past 4 years and that it is getting worst.

Not sure if it was asked already, but what is the refresh_interval in your indices? The default would be 1s, but increasing this value can help with the index performance and reduce the CPU usage, but it can also have other effects on search, so you would need to test.

Also, I could not find, but how many indices and what is the total size of your cluster? You mentioned something around 60 data nodes with 320 GB of disk, which seems a weird size, considering the watermark you would have something close to 16 TB of data, which seems a small size to be split on 60 nodes, I had an on-premises cluster with a hot tier of 16 TB split into 4 nodes only.

I think you will need to review your cluster design to be able to fix this problem.

Lakshya_Gupta · December 15, 2025, 6:57pm

30 seconds.

We currently have 63 data nodes cluster (remaining are 3 master and 5 co-ordinator)…these 63 data nodes have 340/440 gb storage attached to each of them…can you suggest a better design? Also, we only use 3TB out of the 16TB storage

Lakshya_Gupta · December 15, 2025, 7:14pm

Wasn’t able to generate enough load today Will get back on this tomorrow.

linkerc · December 15, 2025, 7:15pm

dropping replica down from 2 to 1 should help.

From my testing in our production cluster, more replicas than 1 doesn’t seem to speed up read noticeably , but it incurs heavy CPU during write. Especially when all of your writes are update.

RainTown · December 15, 2025, 7:16pm

Originally (5 days ago):

So 3 new data nodes added?

Anyways, you keep posting these things like these are real hardware. You have virtual machines. We (at least I) don't know yet

How the storage is "sliced and diced" before presented to the VMs, and what are you sharing it with?
How much storage there is - you quoted a colleague saying you had 2x 3.8TB PM1733 per server which "are shared among the VMs scheduled on that server". Well, that's almost 8TB per server.
We "only use 3TB out of the 16TB storage". That's say 2 servers? Or there are 20 servers, or 50, or 250, or .. ?
Is there any over-provisioning, of anything, going on here?
What exactly is the Virtual layer? Some VMware product? Something else?

I alos note:

p4i isn't an instance type I recognize, so I'm guessing this is some kind of corporate cloud solution copying (a bit) from AWS type terminology. Maths tells me

100 IOPs / GiB

would mean

32,000 IOPS / 320 GiB (which we have no evidence you are seeing anything remotely close to that)

on assumption of linear mapping. but it would also translate to

380,000 IOPS / 3.8TB disk

and I'm a little skeptical on that.

Lakshya_Gupta · December 15, 2025, 7:24pm

It was 63 since the start

Lakshya_Gupta · December 15, 2025, 7:30pm

Will come back on these tomorrow. I have very less idea about this, will need to discuss internally.

Lakshya_Gupta · December 15, 2025, 7:31pm

Thanks for that suggestion, but we can’t go with that to make our system fault tolerant, we need at least 2 replica shards.

linkerc · December 15, 2025, 7:49pm

That’s what we thought too. But we eventually reduce all but few indices down to 1 replica. Unless you are expecting multiple nodes to be down simultaneously. ES is pretty good with shard assignments.

Granted you have custom routing. I assume you let the replica assigns automatically?

Another potential solution is to increase the shard of your heavy indices. Spreading the CPU across more nodes will reduce hotspot.

You will need to reindex for your use case. I assume you guys don’t create new index periodically since it’s update.

What we have done is creating monthly index say “user_202512”. The index contains user info and constantly being updated, etc.

When the user’s data is being updated, we write the data to 2 indices. “user_202512” & “user_202601”.

We simply write current user data to both this month and next month.

The reader simply reads “user_currentMonth” to get the latest data.

That way, our “user” index will rotate monthly to allow changes to settings and/or mappings. Periodically creating new index in ES is beneficial. It spots stale data (bad application) very easily.

Lakshya_Gupta · December 16, 2025, 4:47am

Correct.

We cannot create a new index because our use case involves updates to data which can be 3 years old as well.

Also, we tried re-indexing by increasing the routing_partition_size from 5 to 8, but after re-indexing, all the data for a single routing key went to 1 shard (instead of going to 8 shards).

We used the below command, are you aware why this could have happened?

POST _reindex?slices=50&requests_per_second=-1&wait_for_completion=false
{
  "conflicts": "proceed",
  "source": {
    "index": "listings"
  },
  "dest": {
    "index": "listings_v2",
    "routing": "keep"
  }
}

Lakshya_Gupta · December 16, 2025, 5:09am

In the next post I’ll make sure to consolidate and post all the below things

open questions around the internals of our VM etc
iostat metrics along with ES graphs during period of high CPU spikes
cluster configuration

Since collecting this data can take 1-2 days, I’ll post all the details in a single shot, maybe then we should be able to find a solve

Lakshya_Gupta · December 16, 2025, 8:56am

Hi @RainTown @Christian_Dahlqvist , here are some of the new points:

storage is sliced using LVMs and presented to VMs using virtio-blk
virtualization layer - qemu/kvm
we have 100 iops configured per gb at a blocksize of 4kb from the infra team. (The 100 iops per gb is committed only for a blocksize of 4kb)
I checked internally within my team and we don’t seem to have a noise neighbour problem
cluster configuration
1. 3 master nodes - each having 4 core and 24gi memory
2. 5 co-ordinator nodes - each having 4 core and 53gi memory
3. 62 data nodes - 44 nodes have 320gb storage / 20 cores / 53gi memory each, remaining 18 nodes have 446gb storage / 20 cores / 53gi memory each

this is how our disk is configured

 <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
      <source dev='/home/pcissd/vm-053af617-0264-4cd2-52e8-3b32935676d0-PCI-SSD-P3-disk1'/>
      <target dev='vdb' bus='virtio'/>
      <iotune>
        <read_iops_sec>32000</read_iops_sec>
        <write_iops_sec>32000</write_iops_sec>
      </iotune>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </disk>

I’m generating load meanwhile to run the iostat commands, will share the results soon.

RainTown · December 16, 2025, 9:12am

So this is not elasticsearch question. now, its just IO tuning in a KVM environment.

take that on one of your hot nodes, for a test, and see what you get.

But I am guessing you need to enable multi-queue, add queues='4' (or some bigger number) in the config, after the discard=... I believe, if unset, the default is 1.

On your VM, see if there are dirs under:

ls /sys/block/vdb/mq/

If you have just a directory called 0 then you have 1 queue.

e.g. for me

[root@rhel10x1 ~]# ls /sys/block/vda/mq/
0  1  2  3

The queue count should not exceed processor count.

Lakshya_Gupta · December 16, 2025, 9:32am

ls /sys/block/vdb/mq/

Got below output

0  1  10  11  12  13  14  15  16  17  18  19  2  3  4  5  6  7  8  9

Number of cores = 20, should be good then?

Lakshya_Gupta · December 16, 2025, 10:42am

There is 1 hypothesis we have right now, just thinking out loud.

Let’s debug the highlighted node - From the above graph, we can see that the disk throughput is reaching a max of 385mbps = 3,94,240 KBs.

From this graph, we can see that the disk iops is 3.21K.

Hence, block size = total disk throughput / disk iops = 3,94,240 / 3210 = 122.8161993769 KB.

Our infra team has committed a write iops of 100 per gb only for a block size of 4Kb, however, our block size seems to be reaching ~ 122KB, does that mean we are not actually getting 100iops per gb and are hitting the wall?

Christian_Dahlqvist · December 16, 2025, 11:12am

Did we get anywhere on this?

Lakshya_Gupta · December 16, 2025, 11:25am

Not yet

RainTown · December 16, 2025, 11:47am

Yes, its a decent theory.

And I think I took us down a wrong alley. The

set of 20 queues wont help much here, as we have only one (hot) shard per node, right? So the available parallelism doesn't help much, as each shard has one writer thread.

Lakshya_Gupta · December 16, 2025, 12:25pm

Not necessary, in total we have 5 hot primary shards, meaning in total 15 hot shards (5 primary and 10 of their replicas)…considering our cluster configuration, every node usually has 3 shards, so in the worst case we can get all 15 shards (primary + replica) boiled down to 5 data nodes.

So, in the worst case, a data node can have 3 hot shards.

Topic		Replies	Views
Elasticsearch 7.17.10 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS Elasticsearch	53	1606	June 22, 2023
High CPU load Elasticsearch	10	927	May 10, 2022
Elasticsearch 8.9.1 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS using ECK Elasticsearch	11	976	October 30, 2023
Abnormally high CPU usage for specific queries/dashboards Elasticsearch	2	147	May 15, 2024
Cluster resource usage Elasticsearch	14	454	July 6, 2017

High CPU Usage on a few data nodes / Hotspotting of data

Related topics