High CPU Usage on a few data nodes / Hotspotting of data

Updates to large number of different documents does not necessarily result in a lot of small segments.

Are you doing this using a single Redis instance for 1 billion documents?

I assume you are writing to Elasticsearch concurrently using a number of clients. How do you manage the partition of the documentID space across these to avoid duplicate updates?

Are you only sending update requests every 30 seconds and idling in between or is the processing stacked in some way?

If this works the way you describe I do not think you should see a lot of small segments created even for 1 billion total documents. Do you know the document ID of a document that is heavily updated? If you have this I guess you could retrieve this from Elasticsearch using the get document API once per minute and verify that the _version field increments at a nice slow pace in line with the expected update frequency.

It is quite possible that frequent updates is not at all causing the issues you are describing, so it would be useful to rule that out if possible. You may also use the cat segments API to check the size of the new segments generated.

Okay, I think I didn’t explain it quite well, let me walk you through the entire indexing process we use.

  1. We have 1 billion documents, and 99% of the requests we get are update requests; not many new documents are inserted.
  2. We have an initial topology which writes all these update requests at a document ID level to a redis system wherein the key is the document ID.
  3. While ingesting data to redis, we set the polling time as +30 seconds from the first time a document for a particular key was inserted, this means that if we get 20 update requests for the same document id, there will only be 1 key-value pair in redis with the polling time of 30 secoonds after the first update request came in.
  4. We have an apache storm topology where we can configure the parallel number of workers, and these pick the requests from the redis database. This topology then batches the result and sends a bulk update request to Elasticsearch.

This is the complete process, hope that clarifies things…let me know in case of any more additional details required :smiley: .

I would run the checks I mentioned in order to try to verify it is indeed working as intended.

Does this mean that you are only indexing in bursts every 30 seconds or so?

True, and from the graphs I’ve visualised, im 99% sure that the cpu spike graph is matching that if the number of segments graph. As soon as the number of segments icreases, I always see a cpu increase.

Sure, I will run them during high indexing load and let you know.

No no, since it’s a continuous streaming topology, you can say that the burst is happening every second (for the previous 30 seconds request), somewhat like a sliding window.

/aside - a while ago I suggested to one of the admins to create an kibana/elasticsearch cluster populated with this forums comments. Cos, you know, for search. This (longish) thread demonstrates the use case. Did someone other than me mention VMware in this thread? No so easy to see without a lot of scrolling?

Picking a few small things:

  • you can find out the filesystem by just typing the mount command on a data node, look for your data partition, likely vdb or vdb1, and you will see ext4/xfs/... Paste here if unsure. lsblk -f would also tell you
  • you have mentioned few times "320GB/340gb ssd" but the disk quoted is 440+ GB, I hope this disk is exclusively allocated to elasticsearch?
  • If you are using your own _id, indexing new docs has a higher cost when the index size gets really large. Not likely critical, but you should know that.
  • I might have confused this with another thread, are your "VMs" from VMware or some other virtualization platform?
  • You seemed to very easily increase (v)CPU from 20 to 32, and it's a little mystery why it seemed to make no difference at all. But in end a VM runs on a host, maybe with a bunch of other VMs on same host. VMware, and other platforms, will let you over-provision resources massively. It would likely even allow you to put 2 or more data nodes on same host, unless you specify otherwise. Point is elasticsearch works best on dedicated resources, especially if you want best performance. The assumption here is you are in a corp IT environment, where my experience is that the 'VM team" don't know/care a lot about your use cases, they just satisfy your requests as asked. So onus is on you to be asking right questions.

Between second N and second N+1, their preceding 30-second window has 29 seconds overlapping. Might be me, but I have not understood your explanation here, sorry.

Can you share the index monitoring screens from kibana, equivalent of mine below but for your much more interesting index, with say a typical 1 hour time interval.



1 Like

Here are the snapshots during high cpu spikes

Wow, that’s crazy :open_mouth:.

Got this response @leandrojmp

/dev/vdb1 on /var/lib/elasticsearch type ext4 (rw,noatime,nodelalloc)

Got the below response

NAME    FSTYPE FSVER LABEL UUID                                 FSAVAIL FSUSE% MOUNTPOINT
vda                                                                            
├─vda1  ext4   1.0         363680f3-0237-4987-a303-e236995e1017      6G    33% /
├─vda14                                                                        
└─vda15 vfat   FAT16       E76C-DD8F                               113M     9% /boot/efi
vdb                                                                            
└─vdb1  ext4   1.0         9263cd42-3be5-4efb-95ed-7b0f218f3589    299G    27% /var/lib/elasticsearch

We have a heterogenous mix of data nodes, some of them are 340gb and some of them are 440gb, the node which I’m running the commands on is that of 440gb.

Will get back on this.

Understood, can you let me know what all I shall be asking the IT team?

From July? For a period of a few days, rather than the 1 hour I asked (and I implicitly meant current data). Anyways, they show what you wrote to me, there were periods of high CPU for hours. You have changed a lot of things since July. You ran the lsblk command today, why not get the screenshots today? The IOPS are not high at all, not ins same ballpark as the Samsung product spec sheet you shared.

You have settled that you are using ext4 as filesystem. Other might suggests tweaks there.

But for me outstanding is really the larger scale setup - be it VMware or something else. How is IO provided to the VMs. In detail.

It's all written in this thread. You have an application (elasticsearch) that needs significant IO capability, specifically IOPS, on many instances simultaneously. Your elasticsearch instances need as close to exclusive access to resources (CPU and memory too btw) as is possible.

I had those screenshots handy, hence shared those…but those were the final screenshots after all the optimizations (adding dedup layer, reduce writes, add co-ordinate nodes etc) we had done.

I can share some recent graphs in the morning as well, but the data is pretty much the same.

Sure, I’ll check this with the team which manages these virtual machine instances and then back to you with the full details :smiley:

@leandrojmp any tweaks you might want to suggest here?

Some information I was able to get today states “We don't use SRIOV on the PM1733 drives. We use virtio-block to expose the disk to the VM after creating a logical volume on the 3.84T PM1733 drive. The read and write IOPs limits for p4i instances (this is the type of instance we use) is 100 per GiB.”

I’m not having much knowledge about how these operate internally and what questions to ask to the team, if you can please list down some questions pin-pointing the exact details required, I’ll be really grateful :grin:.

@RainTown / @Christian_Dahlqvist got these inputs when asked my IT team regarding the hardware -

1. No your 60 VMs are not sharing a single PM1733 disk. 60 VMs are scheduled on different physical servers. Every physical server has 2 PM1733 disks which are shared among the VMs scheduled on that server.
2. We don't use SRIOV for PM1733 disks. Rather we just create LVMs and expose them using virtio-blk to VMs
3. Disk formatting is controlled by the VM owners. We just give raw block devices and the VM owners decide what filesystem to use.

Right, so my question is that we have a read aggregation query which runs at a certain QPS that it utilises 25% cpu of the data nodes (this was before we added co-ordinator nodes).

Now after adding the co-ordinator nodes, the co-ordinator nodes only experience a spike of 5%, whereas the load on the data nodes is the same…why are the co-ordinator nodes not helping in the aggregation over here?

Is it necessary for the number of primary shards to be a multiple of the routing_partition number?

It is helping with part of the aggregation work, but as far as I know most of the work is done at the shard level, which has to be executed on the data nodes.

I do not know. Even if you increase the number of partitions I do not see it resolving the issue. At best it probably just gives you a bit of added time.

I have no experience setting this up but would suspectvthere is a level of configurability involved. I ran a quick serach onbline and found some guides around optimising e.g. the number of IO threads. If this is all correctly configured it sounds like you should be able to achieve near native performance corresponding to your IOPS quotas, but there is no way for us to know whether that is the case or not.

Did you have a chance to verify that your deduplication logic is working as expected so we can remove that as a potential issue/contributing factor?

Not yet, have noted that down as one of the action item to complete by today (hopefully :P) :smiley:

1 Like

Approaching 100 posts on this thread ...

@Lakshya_Gupta please try to share the iostat -ztcxd 60 60 or iostat -ztcxd 10 360 output and kibana screenshots, ideally covering same period that includes the 90%+ CPU load. What I want to see is do you ever get a sustained period with IOPS higher than a couple of K on the at-the-time loaded data nodes?

I probably looked at similar posts to @Christian_Dahlqvist about possible issues (gotchas even) with different virtualization setups in terms of IO. But via this forum that sort of low level stuff is going to be really difficult to pin down, just too many variables.

If I've interpreted it right, thats kind of good news, the disks are in the physical servers running the VMs. But a critical word there is "shared". I'd like to know how "shared" these resources are, what you are sharing with, what else are you sharing, are you even having multiple elasticsearch data node VMs on the same physical server? e.g. maybe your increase of CPU count from 20 to 32 made little difference because the CPUs were already overcommitted.

Also:

LVMs here might mean Linux Logical Volume Manager (lvm) devices, but can have other interpretations. But again it's just another level of abstraction, which makes seeing things clearly hard. And, without a doubt, SR-IOV would work better if setup correctly.

As a senior Elastic contributor wrote recently, effectively the TLDR here is:

Elastic node with Dedicated Host or VMs (with dedicated resources) = Best / Stable Outcome

Sure, this metric and the 1 suggested by Christian “Did you have a chance to verify that your deduplication logic is working as expected so we can remove that as a potential issue/contributing factor?”…I’ll try to get these both and post it here as soon as I’m able to see the load.

Sure, let me check this as well.

Sure, thanks for that input, checking on this as well with the infra team.

No, I would mention to check the mount options because per default ext4 are mounted with the parameter relatime then it should be changed to noatime, this adds a boost in speed, but you already shared that the mount options include noatime.

1 Like