High CPU Usage on a few data nodes / Hotspotting of data

Do you recommend switching to XFS instead?

1 Like

I don't think this will make any difference, besides that, to make this change you would need to reinstall and reconfigure your nodes.

Being honest, I'm not sure you will be able to solve this issue as it seems to be a problem with the design choice, you mentioned that you are having this problem for the past 4 years and that it is getting worst.

Not sure if it was asked already, but what is the refresh_interval in your indices? The default would be 1s, but increasing this value can help with the index performance and reduce the CPU usage, but it can also have other effects on search, so you would need to test.

Also, I could not find, but how many indices and what is the total size of your cluster? You mentioned something around 60 data nodes with 320 GB of disk, which seems a weird size, considering the watermark you would have something close to 16 TB of data, which seems a small size to be split on 60 nodes, I had an on-premises cluster with a hot tier of 16 TB split into 4 nodes only.

I think you will need to review your cluster design to be able to fix this problem.

1 Like

30 seconds.

We currently have 63 data nodes cluster (remaining are 3 master and 5 co-ordinator)…these 63 data nodes have 340/440 gb storage attached to each of them…can you suggest a better design? Also, we only use 3TB out of the 16TB storage

Wasn’t able to generate enough load today :confused: Will get back on this tomorrow.

1 Like

dropping replica down from 2 to 1 should help.

From my testing in our production cluster, more replicas than 1 doesn’t seem to speed up read noticeably , but it incurs heavy CPU during write. Especially when all of your writes are update.

2 Likes

Originally (5 days ago):

So 3 new data nodes added?

Anyways, you keep posting these things like these are real hardware. You have virtual machines. We (at least I) don't know yet

  • How the storage is "sliced and diced" before presented to the VMs, and what are you sharing it with?
  • How much storage there is - you quoted a colleague saying you had 2x 3.8TB PM1733 per server which "are shared among the VMs scheduled on that server". Well, that's almost 8TB per server.
  • We "only use 3TB out of the 16TB storage". That's say 2 servers? Or there are 20 servers, or 50, or 250, or .. ?
  • Is there any over-provisioning, of anything, going on here?
  • What exactly is the Virtual layer? Some VMware product? Something else?

I alos note:

p4i isn't an instance type I recognize, so I'm guessing this is some kind of corporate cloud solution copying (a bit) from AWS type terminology. Maths tells me

100 IOPs / GiB

would mean

32,000 IOPS / 320 GiB (which we have no evidence you are seeing anything remotely close to that)

on assumption of linear mapping. but it would also translate to

380,000 IOPS / 3.8TB disk

and I'm a little skeptical on that.

1 Like

It was 63 since the start :sweat_smile:

1 Like

Will come back on these tomorrow. I have very less idea about this, will need to discuss internally.

Thanks for that suggestion, but we can’t go with that :confused: to make our system fault tolerant, we need at least 2 replica shards.

That’s what we thought too. But we eventually reduce all but few indices down to 1 replica. Unless you are expecting multiple nodes to be down simultaneously. ES is pretty good with shard assignments.

Granted you have custom routing. I assume you let the replica assigns automatically?

Another potential solution is to increase the shard of your heavy indices. Spreading the CPU across more nodes will reduce hotspot.

You will need to reindex for your use case. I assume you guys don’t create new index periodically since it’s update.

What we have done is creating monthly index say ā€œuser_202512ā€. The index contains user info and constantly being updated, etc.

When the user’s data is being updated, we write the data to 2 indices. ā€œuser_202512ā€ & ā€œuser_202601ā€.

We simply write current user data to both this month and next month.

The reader simply reads ā€œuser_currentMonthā€ to get the latest data.

That way, our ā€œuserā€ index will rotate monthly to allow changes to settings and/or mappings. Periodically creating new index in ES is beneficial. It spots stale data (bad application) very easily.

1 Like

Correct.

We cannot create a new index because our use case involves updates to data which can be 3 years old as well.

Also, we tried re-indexing by increasing the routing_partition_size from 5 to 8, but after re-indexing, all the data for a single routing key went to 1 shard (instead of going to 8 shards).

We used the below command, are you aware why this could have happened?

POST _reindex?slices=50&requests_per_second=-1&wait_for_completion=false
{
  "conflicts": "proceed",
  "source": {
    "index": "listings"
  },
  "dest": {
    "index": "listings_v2",
    "routing": "keep"
  }
}

In the next post I’ll make sure to consolidate and post all the below things

  • open questions around the internals of our VM etc
  • iostat metrics along with ES graphs during period of high CPU spikes
  • cluster configuration

Since collecting this data can take 1-2 days, I’ll post all the details in a single shot, maybe then we should be able to find a solve :smiley:

1 Like

Hi @RainTown @Christian_Dahlqvist , here are some of the new points:

  1. storage is sliced using LVMs and presented to VMs using virtio-blk

  2. virtualization layer - qemu/kvm

  3. we have 100 iops configured per gb at a blocksize of 4kb from the infra team. (The 100 iops per gb is committed only for a blocksize of 4kb)

  4. I checked internally within my team and we don’t seem to have a noise neighbour problem

  5. cluster configuration

    1. 3 master nodes - each having 4 core and 24gi memory
    2. 5 co-ordinator nodes - each having 4 core and 53gi memory
    3. 62 data nodes - 44 nodes have 320gb storage / 20 cores / 53gi memory each, remaining 18 nodes have 446gb storage / 20 cores / 53gi memory each
  6. this is how our disk is configured

     <disk type='block' device='disk'>
          <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
          <source dev='/home/pcissd/vm-053af617-0264-4cd2-52e8-3b32935676d0-PCI-SSD-P3-disk1'/>
          <target dev='vdb' bus='virtio'/>
          <iotune>
            <read_iops_sec>32000</read_iops_sec>
            <write_iops_sec>32000</write_iops_sec>
          </iotune>
          <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
        </disk>
    
    
    

I’m generating load meanwhile to run the iostat commands, will share the results soon.

So this is not elasticsearch question. now, its just IO tuning in a KVM environment.

take that on one of your hot nodes, for a test, and see what you get.

But I am guessing you need to enable multi-queue, add queues='4' (or some bigger number) in the config, after the discard=... I believe, if unset, the default is 1.

On your VM, see if there are dirs under:

ls /sys/block/vdb/mq/

If you have just a directory called 0 then you have 1 queue.

e.g. for me

[root@rhel10x1 ~]# ls /sys/block/vda/mq/
0  1  2  3

The queue count should not exceed processor count.

1 Like

ls /sys/block/vdb/mq/

Got below output

0  1  10  11  12  13  14  15  16  17  18  19  2  3  4  5  6  7  8  9

Number of cores = 20, should be good then?

There is 1 hypothesis we have right now, just thinking out loud.

Let’s debug the highlighted node - From the above graph, we can see that the disk throughput is reaching a max of 385mbps = 3,94,240 KBs.

From this graph, we can see that the disk iops is 3.21K.

Hence, block size = total disk throughput / disk iops = 3,94,240 / 3210 = 122.8161993769 KB.

Our infra team has committed a write iops of 100 per gb only for a block size of 4Kb, however, our block size seems to be reaching ~ 122KB, does that mean we are not actually getting 100iops per gb and are hitting the wall?

Did we get anywhere on this?

Not yet :confused:

Yes, its a decent theory.

And I think I took us down a wrong alley. The

set of 20 queues wont help much here, as we have only one (hot) shard per node, right? So the available parallelism doesn't help much, as each shard has one writer thread.

Not necessary, in total we have 5 hot primary shards, meaning in total 15 hot shards (5 primary and 10 of their replicas)…considering our cluster configuration, every node usually has 3 shards, so in the worst case we can get all 15 shards (primary + replica) boiled down to 5 data nodes.

So, in the worst case, a data node can have 3 hot shards.