High CPU Usage on a few data nodes / Hotspotting of data

True.

I was more thinking to show the bottleneck at peak load is the IO, and likely either the single thread or the throttled IOPS, rather than CPU. The other cores might help in other parts of the workload, so I would not go as far as "wasted". There's also decent chance the hosts real CPUs are overcommitted by your "VM team".

1 Like

Hi @RainTown @Christian_Dahlqvist I’m back with some new results :smiley: .

So I created 2 vms, 1 with 20 cores, 53G memory and 640GB disk, another with 20 cores, 53G memory and 320GB disk.

As per the 100IOPS/GB storage promise, I ran 2 FIO on these, 1 with blocksize of 128K and 1 with blocksize of 4K. Here are the results -

FIO with BlockSize of 128K

Command used

fio --name=throughput_128k_test \
--filename=/var/lib/elasticsearch/fio_test_file \
--size=10G \
--time_based \
--runtime=60s \
--ramp_time=2s \
--ioengine=libaio \
--direct=1 \
--verify=0 \
--bs=128K \
--iodepth=64 \
--rw=write \
--group_reporting

Results for the 320GB VM

throughput_128k_test: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64
fio-3.25
Starting 1 process
throughput_128k_test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=3635MiB/s][w=29.1k IOPS][eta 00m:00s]
throughput_128k_test: (groupid=0, jobs=1): err= 0: pid=1018086: Fri Dec 19 13:49:25 2025
  write: IOPS=21.7k, BW=2708MiB/s (2840MB/s)(159GiB/60002msec); 0 zone resets
    slat (usec): min=2, max=2006, avg= 7.90, stdev= 5.53
    clat (usec): min=348, max=68787, avg=2945.54, stdev=2823.19
     lat (usec): min=361, max=68795, avg=2953.53, stdev=2823.33
    clat percentiles (usec):
     |  1.00th=[  996],  5.00th=[ 1385], 10.00th=[ 1565], 20.00th=[ 1762],
     | 30.00th=[ 1909], 40.00th=[ 2073], 50.00th=[ 2245], 60.00th=[ 2474],
     | 70.00th=[ 2802], 80.00th=[ 3359], 90.00th=[ 4817], 95.00th=[ 6652],
     | 99.00th=[12256], 99.50th=[18744], 99.90th=[40109], 99.95th=[45351],
     | 99.99th=[57934]
   bw (  MiB/s): min= 1302, max= 4098, per=100.00%, avg=2711.08, stdev=710.78, samples=120
   iops        : min=10418, max=32787, avg=21688.45, stdev=5686.28, samples=120
  lat (usec)   : 500=0.03%, 750=0.13%, 1000=0.88%
  lat (msec)   : 2=34.91%, 4=49.72%, 10=12.60%, 20=1.26%, 50=0.44%
  lat (msec)   : 100=0.03%
  cpu          : usr=10.38%, sys=8.41%, ctx=104224, majf=0, minf=58
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,1300015,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=2708MiB/s (2840MB/s), 2708MiB/s-2708MiB/s (2840MB/s-2840MB/s), io=159GiB (170GB), run=60002-60002msec

Disk stats (read/write):
  vdb: ios=1/1330746, merge=0/177, ticks=3/3736582, in_queue=3736602, util=99.94%

Results for the 640GB VM

throughput_128k_test: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=64
fio-3.25
Starting 1 process
throughput_128k_test: Laying out IO file (1 file / 10240MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=3437MiB/s][w=27.5k IOPS][eta 00m:00s]
throughput_128k_test: (groupid=0, jobs=1): err= 0: pid=1270571: Fri Dec 19 13:49:23 2025
  write: IOPS=27.3k, BW=3410MiB/s (3576MB/s)(200GiB/60001msec); 0 zone resets
    slat (usec): min=2, max=1176, avg= 6.17, stdev= 4.38
    clat (usec): min=324, max=113203, avg=2339.16, stdev=4130.76
     lat (usec): min=327, max=113208, avg=2345.42, stdev=4130.78
    clat percentiles (usec):
     |  1.00th=[  1029],  5.00th=[  1352], 10.00th=[  1483], 20.00th=[  1614],
     | 30.00th=[  1729], 40.00th=[  1844], 50.00th=[  1942], 60.00th=[  2073],
     | 70.00th=[  2212], 80.00th=[  2442], 90.00th=[  2737], 95.00th=[  3326],
     | 99.00th=[  7111], 99.50th=[  9110], 99.90th=[ 72877], 99.95th=[ 73925],
     | 99.99th=[107480]
   bw (  MiB/s): min= 2398, max= 4134, per=100.00%, avg=3413.08, stdev=455.32, samples=120
   iops        : min=19188, max=33074, avg=27304.42, stdev=3642.56, samples=120
  lat (usec)   : 500=0.02%, 750=0.05%, 1000=0.73%
  lat (msec)   : 2=53.65%, 4=42.45%, 10=2.72%, 20=0.06%, 100=0.31%
  lat (msec)   : 250=0.02%
  cpu          : usr=8.95%, sys=10.24%, ctx=124473, majf=0, minf=59
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,1636950,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=3410MiB/s (3576MB/s), 3410MiB/s-3410MiB/s (3576MB/s-3576MB/s), io=200GiB (215GB), run=60001-60001msec

Disk stats (read/write):
  vdb: ios=1/1701233, merge=0/176, ticks=1/3769345, in_queue=3769357, util=99.91%

FIO with BlockSize of 4K

Command used

fio --name=iops_4k_test \
--filename=/var/lib/elasticsearch/fio_test_file \
--size=5G \
--time_based \
--runtime=30s \
--ramp_time=2s \
--ioengine=libaio \
--direct=1 \
--verify=0 \
--bs=4k \
--iodepth=64 \
--rw=randwrite \
--group_reporting

Results for the 320GB VM

iops_4k_test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=125MiB/s][w=32.0k IOPS][eta 00m:00s]
iops_4k_test: (groupid=0, jobs=1): err= 0: pid=1066824: Fri Dec 19 13:56:44 2025
  write: IOPS=31.0k, BW=125MiB/s (131MB/s)(3750MiB/30003msec); 0 zone resets
    slat (nsec): min=1366, max=226119, avg=5024.06, stdev=1325.32
    clat (usec): min=136, max=27872, avg=1994.66, stdev=223.36
     lat (usec): min=138, max=27874, avg=1999.77, stdev=223.40
    clat percentiles (usec):
     |  1.00th=[ 1811],  5.00th=[ 1975], 10.00th=[ 1975], 20.00th=[ 1991],
     | 30.00th=[ 1991], 40.00th=[ 1991], 50.00th=[ 1991], 60.00th=[ 1991],
     | 70.00th=[ 2008], 80.00th=[ 2008], 90.00th=[ 2008], 95.00th=[ 2008],
     | 99.00th=[ 2089], 99.50th=[ 2409], 99.90th=[ 3130], 99.95th=[ 3589],
     | 99.99th=[ 9110]
   bw (  KiB/s): min=128040, max=128208, per=100.00%, avg=128119.87, stdev=45.39, samples=60
   iops        : min=32010, max=32052, avg=32030.00, stdev=11.35, samples=60
  lat (usec)   : 250=0.04%, 500=0.07%, 750=0.07%, 1000=0.07%
  lat (msec)   : 2=70.38%, 4=29.34%, 10=0.03%, 20=0.01%, 50=0.01%
  cpu          : usr=3.64%, sys=17.81%, ctx=942849, majf=0, minf=58
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,959978,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=3750MiB (3932MB), run=30003-30003msec

Disk stats (read/write):
  vdb: ios=0/1027168, merge=0/6, ticks=0/2038914, in_queue=2038918, util=99.79%

Results for the 640GB VM

iops_4k_test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=250MiB/s][w=64.0k IOPS][eta 00m:00s]
iops_4k_test: (groupid=0, jobs=1): err= 0: pid=1320726: Fri Dec 19 13:56:42 2025
  write: IOPS=64.0k, BW=250MiB/s (262MB/s)(7501MiB/30002msec); 0 zone resets
    slat (nsec): min=1384, max=72871, avg=3644.48, stdev=1734.65
    clat (usec): min=125, max=43990, avg=995.84, stdev=356.38
     lat (usec): min=127, max=43991, avg=999.57, stdev=356.41
    clat percentiles (usec):
     |  1.00th=[  922],  5.00th=[  971], 10.00th=[  979], 20.00th=[  988],
     | 30.00th=[  988], 40.00th=[  996], 50.00th=[  996], 60.00th=[  996],
     | 70.00th=[ 1004], 80.00th=[ 1004], 90.00th=[ 1012], 95.00th=[ 1020],
     | 99.00th=[ 1029], 99.50th=[ 1057], 99.90th=[ 1516], 99.95th=[ 1778],
     | 99.99th=[24773]
   bw (  KiB/s): min=255968, max=258012, per=100.00%, avg=256259.27, stdev=238.06, samples=60
   iops        : min=63992, max=64503, avg=64064.85, stdev=59.51, samples=60
  lat (usec)   : 250=0.31%, 500=0.23%, 750=0.25%, 1000=62.13%
  lat (msec)   : 2=37.06%, 4=0.01%, 50=0.01%
  cpu          : usr=4.74%, sys=25.43%, ctx=1454919, majf=0, minf=58
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,1920217,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=250MiB/s (262MB/s), 250MiB/s-250MiB/s (262MB/s-262MB/s), io=7501MiB (7865MB), run=30002-30002msec

Disk stats (read/write):
  vdb: ios=0/2054614, merge=0/6, ticks=0/2029279, in_queue=2029284, util=99.81%

Please run fio with more realistic elasticsearch-like workload. Elasticsearch does more sync and flushes, small blocks, and we (at least I) believe you are single-thread limited.

fio --name=es_like \
--directory=/var/lib/elasticsearch/fio_es \
--numjobs=1 \
--filesize=5G \
--nrfiles=2 \
--rw=write \
--bs=8k \
--iodepth=1 \
--ioengine=libaio \
--direct=1 \
--fdatasync=1 \
--time_based \
--runtime=60 \
--group_reporting

some explanation

numjobs=1 as you likely have single shard
iodepth=1 as elasticsearch does (I believe) serialized fsync
fdatasync=1 → translog
and a smalishl block size + direct I/O. Far from a perfect match, but a bit closer to reality.

My reading of the results you shared are that the IOPS/GB limit is working, and it's unlikely to be the main problem. So increasing/eliminating it won't help. And I fear the main problem is, going full circle, just the design + skew.

EDIT: if you run above with also bs=4K and bs=16K, and you see IOPS stays roughly constant, that would further support that the IOPS/GB limit is not the issue you need to worry about.

1 Like

Does it matter if you use --rw=write or --rw=randwrite for this test?

I agree that the other parameter changes you suggested are likely to result in a more Elasticsearch like workload. :+1:

asking me? No, it does not mater, as for this test I'm really only trying to establish how fast a single synchronous thread writer can commit data. Which, I'm speculating, is unlikely to be constrained by the IOPS limit, and therefore independent of the 320GB / 640GB disk size allocated to the 2 VMs.

1 Like

Ran the above with 4K byte size and got the below results -

320GB Machine

es_like: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
es_like: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=1012KiB/s][w=253 IOPS][eta 00m:00s]
es_like: (groupid=0, jobs=1): err= 0: pid=2276816: Fri Dec 19 17:07:46 2025
  write: IOPS=381, BW=1527KiB/s (1564kB/s)(89.5MiB/60002msec); 0 zone resets
    slat (usec): min=7, max=499, avg=28.04, stdev=12.29
    clat (nsec): min=1430, max=12459k, avg=374270.17, stdev=272386.04
     lat (usec): min=49, max=12495, avg=402.63, stdev=279.96
    clat percentiles (usec):
     |  1.00th=[   45],  5.00th=[   51], 10.00th=[   55], 20.00th=[   61],
     | 30.00th=[   82], 40.00th=[  388], 50.00th=[  445], 60.00th=[  478],
     | 70.00th=[  519], 80.00th=[  553], 90.00th=[  619], 95.00th=[  725],
     | 99.00th=[ 1057], 99.50th=[ 1139], 99.90th=[ 1336], 99.95th=[ 1598],
     | 99.99th=[ 5800]
   bw (  KiB/s): min=  752, max= 6096, per=100.00%, avg=1533.36, stdev=1222.15, samples=119
   iops        : min=  188, max= 1524, avg=383.33, stdev=305.54, samples=119
  lat (usec)   : 2=0.01%, 50=4.55%, 100=27.60%, 250=1.63%, 500=31.71%
  lat (usec)   : 750=29.95%, 1000=2.95%
  lat (msec)   : 2=1.58%, 4=0.02%, 10=0.01%, 20=0.01%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=42, max=14793, avg=427.20, stdev=324.18
    sync percentiles (nsec):
     |  1.00th=[   69],  5.00th=[   84], 10.00th=[   97], 20.00th=[  135],
     | 30.00th=[  298], 40.00th=[  330], 50.00th=[  370], 60.00th=[  430],
     | 70.00th=[  502], 80.00th=[  620], 90.00th=[  804], 95.00th=[ 1012],
     | 99.00th=[ 1336], 99.50th=[ 1512], 99.90th=[ 1752], 99.95th=[ 1816],
     | 99.99th=[ 9792]
  cpu          : usr=1.62%, sys=6.77%, ctx=45831, majf=0, minf=15
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,22913,0,0 short=22913,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1527KiB/s (1564kB/s), 1527KiB/s-1527KiB/s (1564kB/s-1564kB/s), io=89.5MiB (93.9MB), run=60002-60002msec

Disk stats (read/write):
  vdb: ios=0/68669, merge=0/45860, ticks=0/50695, in_queue=77175, util=99.96%

640GB Machine

es_like: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
es_like: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=1761KiB/s][w=440 IOPS][eta 00m:00s]
es_like: (groupid=0, jobs=1): err= 0: pid=2560515: Fri Dec 19 17:07:48 2025
  write: IOPS=582, BW=2331KiB/s (2387kB/s)(137MiB/60001msec); 0 zone resets
    slat (nsec): min=7425, max=79012, avg=21437.94, stdev=7415.54
    clat (nsec): min=1111, max=1128.9k, avg=258184.50, stdev=163192.00
     lat (usec): min=45, max=1153, avg=279.83, stdev=168.74
    clat percentiles (usec):
     |  1.00th=[   39],  5.00th=[   42], 10.00th=[   44], 20.00th=[   52],
     | 30.00th=[   67], 40.00th=[  262], 50.00th=[  318], 60.00th=[  351],
     | 70.00th=[  375], 80.00th=[  400], 90.00th=[  441], 95.00th=[  478],
     | 99.00th=[  553], 99.50th=[  594], 99.90th=[  676], 99.95th=[  734],
     | 99.99th=[  881]
   bw (  KiB/s): min= 1080, max= 8320, per=100.00%, avg=2337.55, stdev=1543.67, samples=119
   iops        : min=  270, max= 2080, avg=584.39, stdev=385.92, samples=119
  lat (usec)   : 2=0.01%, 20=0.01%, 50=17.39%, 100=17.51%, 250=3.93%
  lat (usec)   : 500=57.96%, 750=3.15%, 1000=0.05%
  lat (msec)   : 2=0.01%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=50, max=18252, avg=265.58, stdev=192.05
    sync percentiles (nsec):
     |  1.00th=[   54],  5.00th=[   56], 10.00th=[   74], 20.00th=[  105],
     | 30.00th=[  290], 40.00th=[  298], 50.00th=[  302], 60.00th=[  314],
     | 70.00th=[  322], 80.00th=[  326], 90.00th=[  338], 95.00th=[  382],
     | 99.00th=[  644], 99.50th=[ 1020], 99.90th=[ 1096], 99.95th=[ 1208],
     | 99.99th=[ 8032]
  cpu          : usr=1.55%, sys=7.71%, ctx=69937, majf=0, minf=16
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,34965,0,0 short=34965,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2331KiB/s (2387kB/s), 2331KiB/s-2331KiB/s (2387kB/s-2387kB/s), io=137MiB (143MB), run=60001-60001msec

Disk stats (read/write):
  vdb: ios=1/104760, merge=0/69920, ticks=1/50010, in_queue=75946, util=99.94%

With 16K byte size -

320GB Machine

es_like: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=4372KiB/s][w=273 IOPS][eta 00m:00s]
es_like: (groupid=0, jobs=1): err= 0: pid=2294013: Fri Dec 19 17:10:24 2025
  write: IOPS=461, BW=7382KiB/s (7559kB/s)(433MiB/60003msec); 0 zone resets
    slat (nsec): min=4180, max=85130, avg=23593.46, stdev=12267.17
    clat (usec): min=31, max=4305, avg=345.56, stdev=264.86
     lat (usec): min=44, max=4329, avg=369.41, stdev=272.33
    clat percentiles (usec):
     |  1.00th=[   42],  5.00th=[   46], 10.00th=[   48], 20.00th=[   59],
     | 30.00th=[   70], 40.00th=[  141], 50.00th=[  416], 60.00th=[  461],
     | 70.00th=[  506], 80.00th=[  553], 90.00th=[  627], 95.00th=[  750],
     | 99.00th=[ 1057], 99.50th=[ 1139], 99.90th=[ 1303], 99.95th=[ 1483],
     | 99.99th=[ 2507]
   bw (  KiB/s): min= 3264, max=102994, per=100.00%, avg=7404.07, stdev=10062.95, samples=119
   iops        : min=  204, max= 6437, avg=462.72, stdev=628.93, samples=119
  lat (usec)   : 50=12.70%, 100=25.04%, 250=2.88%, 500=27.86%, 750=26.51%
  lat (usec)   : 1000=3.53%
  lat (msec)   : 2=1.47%, 4=0.01%, 10=0.01%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=42, max=44006, avg=379.78, stdev=453.66
    sync percentiles (nsec):
     |  1.00th=[   44],  5.00th=[   46], 10.00th=[   52], 20.00th=[   97],
     | 30.00th=[  159], 40.00th=[  330], 50.00th=[  346], 60.00th=[  370],
     | 70.00th=[  410], 80.00th=[  490], 90.00th=[  828], 95.00th=[ 1048],
     | 99.00th=[ 1320], 99.50th=[ 1432], 99.90th=[ 1768], 99.95th=[ 1976],
     | 99.99th=[13376]
  cpu          : usr=1.70%, sys=6.70%, ctx=61104, majf=0, minf=15
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,27685,0,0 short=27685,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=7382KiB/s (7559kB/s), 7382KiB/s-7382KiB/s (7559kB/s-7559kB/s), io=433MiB (454MB), run=60003-60003msec

Disk stats (read/write):
  vdb: ios=0/77233, merge=0/43853, ticks=0/50779, in_queue=77318, util=99.96%

640GB Machine

es_like: (g=0): rw=write, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=1
fio-3.25
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][w=6656KiB/s][w=416 IOPS][eta 00m:00s]
es_like: (groupid=0, jobs=1): err= 0: pid=2576303: Fri Dec 19 17:10:28 2025
  write: IOPS=686, BW=10.7MiB/s (11.2MB/s)(643MiB/60001msec); 0 zone resets
    slat (nsec): min=4318, max=71284, avg=19136.03, stdev=9091.16
    clat (usec): min=27, max=1053, avg=229.78, stdev=164.34
     lat (usec): min=43, max=1071, avg=249.10, stdev=171.59
    clat percentiles (usec):
     |  1.00th=[   40],  5.00th=[   41], 10.00th=[   43], 20.00th=[   49],
     | 30.00th=[   58], 40.00th=[   78], 50.00th=[  273], 60.00th=[  322],
     | 70.00th=[  355], 80.00th=[  388], 90.00th=[  429], 95.00th=[  469],
     | 99.00th=[  553], 99.50th=[  594], 99.90th=[  693], 99.95th=[  766],
     | 99.99th=[  914]
   bw (  KiB/s): min= 4192, max=136160, per=100.00%, avg=11024.86, stdev=15767.97, samples=119
   iops        : min=  262, max= 8510, avg=689.05, stdev=985.50, samples=119
  lat (usec)   : 50=21.49%, 100=20.76%, 250=5.03%, 500=49.98%, 750=2.68%
  lat (usec)   : 1000=0.05%
  lat (msec)   : 2=0.01%
  fsync/fdatasync/sync_file_range:
    sync (nsec): min=50, max=11958, avg=242.23, stdev=169.53
    sync percentiles (nsec):
     |  1.00th=[   53],  5.00th=[   55], 10.00th=[   57], 20.00th=[   74],
     | 30.00th=[  116], 40.00th=[  302], 50.00th=[  310], 60.00th=[  318],
     | 70.00th=[  326], 80.00th=[  334], 90.00th=[  342], 95.00th=[  350],
     | 99.00th=[  438], 99.50th=[  892], 99.90th=[ 1128], 99.95th=[ 1208],
     | 99.99th=[ 7520]
  cpu          : usr=1.81%, sys=7.51%, ctx=90921, majf=0, minf=13
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,41179,0,0 short=41179,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=10.7MiB/s (11.2MB/s), 10.7MiB/s-10.7MiB/s (11.2MB/s-11.2MB/s), io=643MiB (675MB), run=60001-60001msec

Disk stats (read/write):
  vdb: ios=0/114671, merge=0/64793, ticks=0/50065, in_queue=76472, util=99.93%

Thanks. You see how the IOPS are in the hundreds? And that the 640GB is a bit better than the 320GB in both cases, though not close to 2x. And that util=99%+ in all 4 results.

So, we could try and spend time getting a fio profile that matches real-world elasticsearch profile closer and closer. I don't think that is now worth the effort. You are neither bandwidth nor IOPS bound - the storage is not the core problem, and faster/better storage would not "solve" the problem, at best it could just squeeze a little bit more out of it.

You already established that more CPU made no real difference. And I speculated that less CPU would make no real difference, to this aspect at least.

So ... convinced yet it is probably the design + skew ?

One thing, this thread has been open a while. You didn't share any recent iostat results from the loaded nodes during a period of "maximum" load? We established your dedup layer helped a lot. And the "High CPU Usage" periods are now a lot shorter. whats now the "issue" yo are trying to solve? Is a couple of minutes of less great performance now and again not "acceptable"? Who, specifically, is complaining now, and about what ?

1 Like

Right…

Why do you say so? Sorry I might be wrong, but don’t you think getting disk with faster IOPS would help solve the issue since currently we are getting IOPS only in a few hundreds? (Isn’t IOPS the bottleneck? based on the latest run)

Umm, agreed that we can have a better solve, but still I was expecting that Elastic is more scalable and can solve for this.

I have not been able to generate load for the skewed data points, and when the load came organically, the command was not running and hence couldn’t get the data points.

And true, the high cpu usage periods are now a lot shorter, but my hunch is that we scale tomorrow, this issue might come again at a larger scale, hence we are trying to optimize this now itself.

Yes :confused: , we cannot breach our SLAs during festive periods wherein we usually get such a high load. Moreover, I’ve been working on this problem statement from a long time and I personally really want to solve this :smiley: .

Do you recommend switching to the below configs based on the above nfr

"translog.durability": "async",

"translog.sync_interval": "30s"

Currently, the index has the below config :

 "translog" : {
          "generation_threshold_size" : "64mb",
          "flush_threshold_size" : "512mb",
          "sync_interval" : "5s",
          "retention" : {
            "size" : "-1",
            "age" : "-1"
          },
          "durability" : "REQUEST"
        }

No need to add fake load. Just observe the actual system. If your real load isn't generating a problem period worth capturing/sharing, be thankful!!

The disk, and the layers in between, can clearly sustain much higher IOPS than you see with the latest fio tests, and you see in real life usage.

Elasticsearch simply cannot reach the configured limit on real load, as it tries to keep data durable (which is a good thing btw) via a single (or low number) of hot IO threads. This is "by design" in elasticsearch, and all other data stores that use disk will be somewhat similar. Your pattern would maybe better match an in-memory store, though thats a wider discussion. I am also not going to comment on the translog settings, as I simply don't know. Data durability is good.

I don't work for elastic, so this is just my view, but you, er, self-limited the scalability in your design choices. That is only demonstrated with the reality of the skew in data, that you likely were unable to predict, so don't be too hard on yourselves.

1 Like

If your deduplication logic is working it may help to increase the refresh interval and see if it makes any difference. It is an easy change that can be done without downtime, so worth trying. I would also recommend running a test to verify that the deduplication logic is working as expected. If you do currently have the load available, maybe you could add updates to a dummy document to the queue, e.g. 10 updates per second over a 5-minute period and then check the version number after that has all been processed before deleting the dummy document (this should result in just 10-12 updates). It does not really matter which shard the document does to as we are just looking to see the number of resulting updates.

If you have a test system to use for this, make sure you are using the same level of parallelism and consumers as in production. If there is any issue here it is in my experience not unlikely to be related to some type of concurrency issue that may be hard to find during smaller scale testing.

Changing durability to async is something I would only do as a last resort as it reduces resiliency and can cause problems. I avoid doing this so do not have any experience with the kind of problems it may or may not cause.

As @RainTown pointed out you have limited how efficiently Elasticsearch can work through your design choices, so I think a change in the design may be required if nothing else works.

… and see if you can somehow address source of the duplication.

Is it really duplicate documents, exact same content ? Or just very rapidly changing documents ?

And the skewed keys, is it very small number of keys tfat cause the skew? How small?

We’re at point where the lack of detailed application / use case knowledge is a barrier for me to make many more suggestions.

EDIT: slightly wild speculation on my part, but my hunch is a more balanced solution would be better in your use case. For example a smaller cluster, bigger disks, significantly more shards for the critical index, and forget the custom routing. There is a decent chance that the calculus is now very different, the very trade-off I was thinking about in my first response on this thread. It's at least worth re-evaluating.

EDIT2: On the dedup, I didn't quite understand still, sorry. But you did write

So in a nutshell, every 30 seconds, we can only have 1 update per listingId (_Id in Elastic)

If thats what is actually happening, updates to the same _id are happening at a minimum of 30s apart, I am pretty surprised you are having issues. That said, no-one answer on @Christian_Dahlqvist 's

EDIT3: You ruled out that there was any ingesting of docs with refresh=true, right? or any process that might be forcing a refresh at a higher internal than you set for the index (30s)

btw:

Screenshot 2025-12-19 at 20.22.33

I think audience for this thread will now be pretty low.

1 Like

I looked through the thread but could not find what bulk request size you use. If this is small or updates in it spread across many shards, resulting in small number of documents per shard per bulk request, this can result in increased overhead in terms of syncing to disk. As we have seen from the benchmark this seemed to have an impact on the number of IOPS achieved.

Can you find out some statistics about this?

I do not think it should affect merging frequency, but could potentially increase I/O load and thereby indirectly merging.

The current refresh interval is 30 seconds, do you suggest going beyond that?

Yeah, I’m trying to do that but somehow not able to generate duplicate events, I’ll see if there is some other way.

Oh :confused: .

Duplicate documents with the exact same content.

Yeah, 5 keys out of a few hundred thousands.

Let me know if you have any more questions around the working of the application :smiley: .

I won’t say there are 0 instances of this, but we do call force refresh approximately 9 times in 6 hours, is that considerable?

Sounds interesting, how do I validate this?

No, if it is already set to 30 seconds I do not see any reason to further increase it.

The issue with frequent updates will be caused by all types of updates, not just duplicate events. If you have frequent updates to the same document that are different and not caught by your deduplication layer you would benefit from merging these to one in your application logic. This is what I thought you were doing as I believe you said you buffered by document ID and that only a single update of any kind would be sent for a specific document every 30 seconds.

No, that should not cause a problem at that frequency.

You need to check the code responsible for ingesting data into Elasticsearch. I do not think received bulk sizes are tracked in monitoring.

Yep, this is what we are doing.

At this point I do not think you will see any major improvement from tweaking the infrastructure. It may get you some, but likely just postpone the issues a bit further.

If there are any issues in your ingest pipeline, resolving those may however have a material impact. I would therefore again recommend that you verify that it works as intended with respect to merging and deduplicating events and check that you are using reasonably large bulk requests.

If there is nothing to be fixed there I believe you may need to rearchitect the solution in order to significantly improve performance. This could be a minor change, e.g. breaking out the hot IDs into a separate index that does not use routing in order to better spread out this load, or a more fundamental one where you review the use case and design as a whole.

It is worth noting that it is your query patterns together with high update rate that have forced you down this path of using routing, which has resulted in hotspots. If you decide to rearchitect it I would recommend you start looking at your query patterns and try to revise that.

Elasticsearch is a search engine and is optimised for retrieval of reasonably small result sets based on relevance. It is not optimised for highly concurrent retrieval of large bulk result sets. It is also, as you have noted, not optimised for high update rates.