Index Dimensioning and Optimization (across the Cluster)

Hello,

We have an Elasticsearch (v7.5.3) cluster with 4 nodes (each is a VM with 16 CPUs, 32 GB of RAM and JVM heap size of 16GB) and all configured as data nodes.

Our main index has around 110M records (~70 GB) that we initially configured with 4 shards and 3 replicas.

At first, after filling the index, we did our preliminaries tests and everything looked fine and queries were pretty fast (we're mainly interested in aggregations and scroll downloading)… but after we started the scripts that do the inserts and (mainly) updates, with around 100 bulk requests per minute, the queries started lagging behind and almost always timing out.

We tried re-indexing with different configurations like 4 shards/1 replica and 24 shards/1 replica... but they seemed to perform even worse then the initial configuration.

Any suggestions please on what can be done to optimize this setup?

Thanks

When you are running bulk updates/inserts and queries slowed down have you tried to identify what is limiting performance? Is CPU saturated? Are you seeing long or frequent GC? How is disk I/O and iowait looking like? What type of storage do you have?

Thank you Christian for your prompt reply.

I monitored the system for some time (while I was getting the timeouts) but I couldn't identify where the bottleneck might be. I'll share some of the gathered metrics below:

iostat

cpu_avg (%user):
Server , Avg , Max , Min , StDev
Server1, 3.642, 13.48, 0.06, 2.985
Server2, 2.266 , 13.86 , 0 , 2.591
Server3, 2.812 , 24.34 , 0 , 3.338
Server4, 4.482 , 22.52 , 0 , 4.55

cpu_avg (%iowait):
Server , Avg , Max , Min , StDev
Server1, 17.757 , 55.63 , 0 , 10.494
Server2, 7.348 , 38.1 , 0 , 7.646
Server3, 8.797 , 32.5 , 0 , 7.041
Server4, 19.083 , 44.4 , 0 , 8.72

tps:
Server , Avg , Max , Min , StDev
Server1, 293 , 1524 , 0 , 230
Server2, 147 , 1463 , 0 , 249
Server3, 139 , 908 , 0 , 183
Server4, 70 , 763 , 0 , 96

kB_read/s:
Server , Avg , Max , Min , StDev
Server1, 4260 , 39156 , 0 , 5291
Server2, 3148 , 28136 , 0 , 4281
Server3, 3145 , 28828 , 0 , 4203
Server4, 4112 , 21752 , 0 , 3481

kB_wrtn/s:
Server , Avg , Max , Min , StDev
Server1, 1129 , 19309 , 0 , 2263
Server2, 690 , 13788 , 0 , 1755
Server3, 1104 , 127374 , 0 , 7605
Server4, 2020 , 60912 , 0 , 6775

GC Analysis :

Server1:
JVM memory size:
  Generation , Allocated , Peak
  Young Generation , 973.44 mb , 973.44 mb
  Old Generation: Allocated , 14.94 gb , 2.08 gb
  Meta Space: Allocated , 1.09 gb , 93.64 mb
  Young + Old + Meta space , Allocated: 16.98 gb , 3.11 gb

Throughput  : 99.603%
Latency:
  Avg Pause GC Time: 32.0 ms
  Max Pause GC Time: 60.0 ms

GC Pause Duration Time Range :
  Duration (ms),No. of GCs,Percentage
  0 - 10 , 2 , 0.96%
  10 - 20 , 13 , 6.25%
  20 - 30 , 139 , 66.83%
  30 - 40 , 49 , 23.56%
  50 - 60 , 3 , 1.44%
  60 - 70 , 2 , 0.96%

Server2:
JVM memory size:
  Generation , Allocated , Peak
  Young Generation , 973.44 mb , 973.44 mb
  Old Generation , 14.94 gb , 1.72 gb
  Meta Space , 1.08 gb , 92.22 mb
  Young + Old + Meta space , 16.98 gb , 2.73 gb

Throughput  : 99.797%
Latency:
  Avg Pause GC Time: 35.2 ms
  Max Pause GC Time: 190 ms

GC Pause Duration Time Range :
  Duration (ms),No. of GCs,Percentage
  0 - 100 , 86 , 98.85%
  100 - 200 , 1 , 1.15%

Server3:
JVM memory size:
  Generation , Allocated , Peak
  Young Generation , 973.44 mb , 973.44 mb
  Old Generation , 14.94 gb , 3.01 gb
  Meta Space , 1.08 gb , 92.42 mb
  Young + Old + Meta space , 16.98 gb , 4.01 gb
  
Throughput  : 99.637%
Latency:
  Avg Pause GC Time: 35.2 ms
  Max Pause GC Time: 120 ms

GC Pause Duration Time Range :
  Duration (ms) , No. of GCs , Percentage
  0 - 100 , 152 , 99.35%
  100 - 200 , 1 , 0.65%

Server4:
JVM memory size:
  Generation , Allocated , Peak 
  Young Generation , 973.44 mb , 973.44 mb
  Old Generation , 14.94 gb , 4.23 gb
  Meta Space , 1.09 gb , 94.99 mb
  Young + Old + Meta space , 16.98 gb , 5.28 gb

Throughput  : 99.537%
Latency:
  Avg Pause GC Time: 36.1 ms
  Max Pause GC Time: 70.0 ms

GC Pause Duration Time Range :
  Duration (ms), No. of GCs , Percentage
  0 - 10 , 2 , 1.02%
  10 - 20 , 15 , 7.61%
  20 - 30 , 82 , 41.62%
  30 - 40 , 68 , 34.52%
  50 - 60 , 20 , 10.15%
  60 - 70 , 8 , 4.06%
  70 - 80 , 2 , 1.02%

That is quite high iowait. What type of storage are you using?

Hard Disk Drives

First I would recommend you upgrade to the latest version as I believe there have been some improvements to update handling. Reducing the number of replicas should also help if you are having Io issues.