High CPU Usage on a few data nodes / Hotspotting of data

Lakshya_Gupta · December 13, 2025, 8:28am

Hey, good morning.
Why so? If the underlying shared resource is choking, then how does it matter if the hot node is sending 10K IOPS vs the cold node is sending 100 IOPS, shouldn’t both of them experience a bottleneck?

RainTown · December 13, 2025, 8:29am

Yes, but to a much lesser extent. But, its not shared storage, so ... point is moot, no ?

Lakshya_Gupta · December 13, 2025, 8:36am

You’re right, it should display NVMe right? (Thereby making using of SR-IOV, which has been mentioned in the chip documentation I shared) Let me confirm this with the VM team once.

Lakshya_Gupta · December 13, 2025, 8:38am

Are we sure that the storage is not shared? I guess you asked me to confirm with our internal VM team if the resource is finally shared or individual , or am I missing something here?

RainTown · December 13, 2025, 9:34am

Thread is getting a bit like a WhatsApp group chat, which I'm not sure is ideal.

You shared the specific disk spec, the Samsung model. It looks like a fine, high performance disk from that spec. I see now it comes in different sizes, "2.5-inch form factor with capacities ranging from 1.92TB to 15.36TB". Your partition, vdb, is 478GB, so either they are wasting the rest of the disk or it's somehow shared, in some respect.

So yes, we still need that detail of exactly how the storage is sliced / diced / accessed by the data node VMs. You may as well share the server model / spec too.

Lakshya_Gupta · December 13, 2025, 6:44pm

Hi, sure, I’ve raised a mail to the VM team to get further details…since it’s a holiday tomorrow, I’ll most likely be able to get back with a reply after a day or so…will keep you posted as soon as I get back the storage details

Christian_Dahlqvist · December 13, 2025, 6:48pm

The storage you are using looks to have good performance. You do however have a layer of virtualization between Elasticsearch and the disk and if this is not correctly or optimally configured this could mean that you are not getting the near-native performance you are expecting.

leandrojmp · December 13, 2025, 8:13pm

What is the filesystem you are using when mounting the disk inside the VMs?

xfs or ext4?

Lakshya-Gupta · December 14, 2025, 5:40am

Is it possible to check this if I have the IP of the VM and the access to run commands on top of it (if yes, can you plesae share some commands) ? Or do I need to get it checked from the VM team?

Lakshya-Gupta · December 14, 2025, 5:42am

Yes, let me get back with the details around those by tomorrow or day after…meanwhile, just out of curiosity, can you tell me that considering I have a 60 data nodes cluster with 340gb ssd / 20 cores & 53gib memory for each node, how much indexing rate can I reach without throttling the cpu?

Lakshya-Gupta · December 14, 2025, 5:44am

Also, @Christian_Dahlqvist / @RainTown if you could please help me with these queries as well, these were part of optimization we did, however, we didn’t receive any great outcomes.

Christian_Dahlqvist · December 14, 2025, 7:58am

I can not say that as it depends on a lot of factors, and your use case is not the typical ingestion of immutable log or metrics documents into time-based indices scenario which most benchmarks around indexing I am aware of use. This is in my opinion a use case where benchmarking always is required in order to find out what is possible and what the limiting factors are. Note that:

Updating documents is more expensive than indexing new immutable documents where Elasticsearch assigns the document ID automatically as each insert or update requires a search on the shard for the potentially existing document before the write can take place.
Mappings used can play a big part. If you use nested documents/mappings or parent-child extra work need to be done.
Frequent updates to individual documents can cause a lot of extra load in terms of merging. You outlined that you have addressed this but I can not tell whether this solution is 100% accurate/efficient or not.
Document size and complexity has a big impact on indexing throughput as more work need to be done. Complex mappings, e.g. with a lot of analysers, can also require a lot of work and slow down indexing.
If you are using replicas, writes need to be replicated to these across the network before the write completes. Network performance can therefore become a bottleneck that is sometimes overlooked.

Your issue seems related to merging and indexingb, which coordinating-only nodes are not involved in, so I do not understand the question. Coordinating-only nodes im my experience primarily help with queries and aggregations where the can take some load off the data nodes.

I can’t help with this as I do not know the internals here. It is possible that your key structure and how it hashes could affect how data is distributed and cause an uneven load and hotspots. It could also be how the routing partition size corresponds to the number of primary shards. If you change the shard partition size to an odd number, e.g. 9, do you get the same result?

Lakshya-Gupta · December 14, 2025, 10:31am

We have not been able to solve the cpu spike which is being caused due to segment merging. Additionally, 1 more point. I’ve noticed that sometimes we don’t have a high indexing rate, but our number of segments increase and segment merging kicks in which causes the cpu spike. Based on all the graphs I’ve analysed till now, the cpu spike corresponds to the graph of the segment merge and not that off the high indexing rate (although high indexing rate in turn leads to segment merge).

Christian_Dahlqvist · December 14, 2025, 10:37am

I believe frequent updates to individual documents can result in a lot of small segments being generated, which will lead to increased merging. If you do indeed see a lot of small segments, I would examine the indexing process and deduplication logic you outlined to verify that your implementation is correct and that you do not have a lot of concurrent updates to individual documents (within single bulk requests or even across parallel bulk requests).

Lakshya-Gupta · December 14, 2025, 10:39am

We do have a lot of concurrent updates request, in fact, 99% of our requests are update. The de-duplication logic is to reduce the number of updates, but event post that, all the requests are update requests only.

Christian_Dahlqvist · December 14, 2025, 10:47am

Is the deduplication logic global per document or just covering a small subset of the ingest process, e.g. a single batch or a single client?

If you can have lots of updates to the same document(s) arrive in parallel from different places I would expect increased merging pressure and overhead as you are seeing.

If you can describe the ingest and deduplication process in more details maybe someone can come up with tweaks or alternatives?

If I was asked with handling large number of updates that needed to be deduplicated for efficiency I would probably start experimenting with something like this:

Set up a Kafka cluster with a single topic for all updates. This would have a number of partitions that is sufficiently large to support the ingest rate. Updates would be routed to partitions based on the document ID to ensure all updates related to a single document end up in the same partition but documents realted to a hot Elasticsearch routing ID are spread out.
Each partition would have a single consumer that reads a large batch size from Kafka, deduplicates the batch and then indexed the data into Elasticsearch. This would ensure no concurrent updates are performed for single documents. It could still result in rather frequent updates for hot documents though, which is why using large bulk requests is likely beneficial (increases deduplication efficiency).

Lakshya-Gupta · December 14, 2025, 10:51am

So we have only 1 source which calls the update and therein we have implemented the de-duping logic + the batch update. So overall, let’s say hypothetically we get 10K requests per second, making it a total of 30K requests per 30 seconds, and out of these 30K requests, let’s say that there are only 15% unique document ids, then only these 15% requests will be updated in batches on our Elasticsearch. Does that help clarify?

Christian_Dahlqvist · December 14, 2025, 11:02am

No, I am still not sure I understand how X number of original requests translate to Y number of document updates through the bulk API.

If you are saying that each unique document with the current logic will only receive a single update request per 30 seconds I would not expect a lot of small degments to be generated.

Lakshya-Gupta · December 14, 2025, 11:09am

We are using a key-value based redis data structure in order to implement deduping. The id here will be same as the document id in Elasticsearch, hence, the requests coming up are queued for 30 seconds, so that requests on the same key are overridden in the map, and after 30 seconds, only 1 update per ID is done.

Lakshya-Gupta · December 14, 2025, 11:09am

True, but we have 1 billion documents

Topic		Replies	Views
Elasticsearch 7.17.10 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS Elasticsearch	53	1606	June 22, 2023
High CPU load Elasticsearch	10	927	May 10, 2022
Elasticsearch 8.9.1 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS using ECK Elasticsearch	11	976	October 30, 2023
Abnormally high CPU usage for specific queries/dashboards Elasticsearch	2	147	May 15, 2024
Cluster resource usage Elasticsearch	14	454	July 6, 2017

High CPU Usage on a few data nodes / Hotspotting of data

Related topics