High CPU Usage on a few data nodes / Hotspotting of data

Christian_Dahlqvist · December 16, 2025, 12:45pm

If you have 15 hot shards and you know which ones they are you might benefit from monitoring their distribution and reroute when required through a script in order to ensure they are spread out. It is not elegant, but might reduce overload on individual nodes.

RainTown · December 16, 2025, 12:55pm

This only means if you happened to have 3 hot shards on same node, your IO might be more efficient, and you'll see more raw IOPS on that node as the node can use different threads.

But main issue will still boil down to the skewed routing keys, which means only 15 threads (across whole cluster) are doing most of the IO. It's this limited concurrency that creates a bottleneck. The iotune settings probably just bound that bottleneck. I hope, if you are able to remove/increase those limits, you will see some, maybe substantial, improvement. The deduping helps, as lots of updates is sort of worst case scenario here, so its important to validate (as per @Christian_Dahlqvist ) thats doing what you think its doing.

Lakshya_Gupta · December 16, 2025, 7:15pm

Yeah, I’ve not been able to generate the load and verify that using the commands specified by Christian, however, there are other metrics using which I can 99% say that the de-dup layer is working fine (I’ve compared the number of writes which used to happen before the de-dup layer and after it…there is a 90% reduction in the number of writes, and the number of high cpu instances have also dropped from 8-10 times daily to at max 1-2 occurrence a day).

However, most likely, by tomorrow I should have the load test ready along with the iostat results.

Christian_Dahlqvist · December 16, 2025, 8:57pm

If the current deduplication solution reduced the write/update load by 90% that is indeed a great reduction and a sign that it is working, at least to some extent. It does not necessarily prove that it is working as you specified though. If there was a mistake in the logic or implementation and a correct implementation could have reduced the writes by 95% (instead of 90%), this would have halved the number of writes you are experiencing now, so it is in my opinion worth verifying this.

Lakshya_Gupta · December 17, 2025, 6:25am

Damn, that’s a very fair point, as an additional part, I’ll also revisit our de-duplication logic to see if we can fine tune that layer as well…didn’t realise that every additional % we get beyond 90 will have such a major impact.

Christian_Dahlqvist · December 17, 2025, 7:36am

If you have a hot document that recieves 10 updates per second that are sent to the deduplication layer, a reduction by 90% results in one update being processed per second. If instead the logic worked and only one write was generated per 30 seconds, that would be a reduction by 99.667% (300 updates reduced to 1). The thing is that I believe you only need one hot document like this to cause the problem for each shard as that is enough to trigger new small segments being created, so even if the logic worked for most documents it could still have an impact.

Lakshya_Gupta · December 17, 2025, 8:04am

You are right, and this is how our logic works. What I said 90% for 30 seconds window, it meant that reducing the total number of writes (Cold + Hot) by 90% per seconds…what this essentially means is that the total write per document is at most 1 in 30 seconds, and since cold documents, don’t receive multiple duplicate updates, their % reduction would not play a major role here…it’s the hot documents which are contributing to this 90% goodness factor, so we can say that 99.667% is what we get for the hot data.

Lakshya_Gupta · December 17, 2025, 8:05am

If I have to explain the de-dup layer in the simplest terms, it’ just “At max 1 update per document can be passed through this layer per 30 seconds even if we receive 500 updates in 30 seconds for the given document”.

Lakshya_Gupta · December 17, 2025, 8:27am

@RainTown / @Christian_Dahlqvist I was thinking of creating 2 VMs, one with 320gb ssd and another with 640gb ssd. Given the fact that we have 100 IOPS per gb, I’ll try to simulate load on both of them and using FIO (Flexible IO), see if we are hitting the IOPS, what do you guys think? (If the same load is able to pass through the 640gb machine with less cpu % spike, then we can say that it’s an IOPS issue and we need faster disks?)

Christian_Dahlqvist · December 17, 2025, 8:41am

I would try to verify that the deduplication logic is working as intended first. It may have a seemingly low probability of being the cause of the issue, but if it did turn out to be incorrect fixing it could have an order of magnitude larger gain that perhaps doubling I/O performance.

I am not saying I believe it is an issue, but prefer to tackle low hanging fruit with potentially large impact first.

RainTown · December 17, 2025, 8:43am

Feel free to explain it in a more complex way

The important thing is that it's accurate.

My simple understanding says If you change that 30s window to a 60s window, you'd reduce from 2 writes/60s to 1 write/60s for "often updated documents". If thats almost all the updates, then that's a decent win.

But actually I thought you said you were doing the writes even second, with a sliding 30s "lookback window", or similar? If so, a hot doc that gets an update every single second would still get updated every single second, but with refresh set to 30s a lot of those updates are never found in a search.

Christian_Dahlqvist · December 17, 2025, 8:48am

If I remember correctly there was a change a number of major versions ago where the write path was simplified. I recall this leading to a refresh being performed whenever a document in the transaction log is updated, which is why frequent updates are expensive. This overrides the refresh interval.

If this is no longer the case or if I recall incorrectly I am sure someone will correct me.

RainTown · December 17, 2025, 8:49am

The course model, a certain number of IOPS per GB, doesn't really fit your use case.

It would be I guess interesting to see fio numbers for the "320GB" and "640gb" case, to see if that kvm/qemu setup section actually "works as designed". But, IMHO, easier test is just remove that section entirely and test that. Try driving with the handbrake off.

Lakshya_Gupta · December 17, 2025, 8:56am

Definitely, I have that in my checklist, as soon as I’m successfully able to generate the load, this is one of the things I’ll make sure to verify.

Lakshya_Gupta · December 17, 2025, 9:19am

Okay, I see there is a lot of confusion around how the de-dup layer is implemented, I guess I’m not explaining it well enough , let me re-explain the whole dedup layer, this time in full depth

This is the design - here listingId is our _id in Elasticsearch as well as the key in redis on which de-dup is happening. (Note that the design doc says 60 seconds, but we went ahead with 30 seconds before we cannot breach the SLAs)

Let’s assume 3 listingIds, LST_A, LST_B, LST_C for which we are getting an update request every 1 second. Let’s consider the initial request were received at Time T1.
The polling time for these entries will be set at T1 + 30 seconds. Any updates coming for these listings between T1 to T1 + 30 will not cause an update in the redis.
At T1 + 30 seconds, the spout will pick the entry from Redis Queue 1 and put it in the processing Redis Queue 2. After this process, any entries coming for LST_A, LST_B, LST_C will be set with a polling time of + 30 seconds from now and this process continues.
So in a nutshell, every 30 seconds, we can only have 1 update per listingId (_Id in Elastic).

Lakshya_Gupta · December 17, 2025, 9:20am

Since you mentioned versions, just to point out, we are currently on ES 7.17…do you reckon any goodness if we go to version 8.x or 9.x?

Lakshya_Gupta · December 17, 2025, 9:21am

I don’t think we can remove the 100IOPS/GB configuration. (Hope you are talking about this config, or is it something else?)

Christian_Dahlqvist · December 17, 2025, 9:25am

I am not aware of any changes that would have an impact on this scenario, but would still recommend you upgrade to a supported version before you fall too far behind.

RainTown · December 17, 2025, 9:51am

This forum is for elasticsearch issues. That aspect of your problem, or rather troubleshooting your problem, is not an elasticsearch issue, it's a company policy issue.

Request a VM with maximum size, 3.8TB?, virtual disk. That way you get the maximum "IOPS" that the specific rules allows. In fact, better would be to use the SR-IOV tech the drive expressly supports, but that option is also not open. Those are organizational issues.

The test with massive drive, hence max IOPS, is useful. You might see fantastic numbers with fio. But remember you likely have a single hot IO threads per hot shard. I looked for ways within kvm/qemu to confirm/demonstrate this inside the VM, but could not find anything definitive. I am presuming you don't have access to the host.

EDIT1: when poking around trying to understand, I came across few posts saying using virtio-scsi, not virtuo-blk. Without clear consensus. The why boiled down to virtio-blk == “here's a a disk, send it requests” vs virtio-scsi == "here's a a controller, send it commands".

EDIT2: Another test you could do is actually reduce the vCPUs in a VM. 20 -> 32 made no significant difference? I expect 20 -> 10 might make little real-world-performance difference either.

Lakshya_Gupta · December 17, 2025, 10:18am

You’ve read my mind aha, this was one of the steps I wrote down to check if we actually require 20 cores or we are just wasting 10 cores. You’re suggesting the same right? Going to 10 cores won’t give performance benefits but will reduce the number of cores we might be wasting, correct? Or am I getting it wrong and moving to 10 cores can improve the performance somehow?

Topic		Replies	Views
Elasticsearch 7.17.10 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS Elasticsearch	53	1606	June 22, 2023
High CPU load Elasticsearch	10	927	May 10, 2022
Elasticsearch 8.9.1 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS using ECK Elasticsearch	11	976	October 30, 2023
Abnormally high CPU usage for specific queries/dashboards Elasticsearch	2	147	May 15, 2024
Cluster resource usage Elasticsearch	14	454	July 6, 2017

High CPU Usage on a few data nodes / Hotspotting of data

Related topics