Index Dimensioning and Optimization (across the Cluster)

wissam · February 22, 2021, 8:30am

Hello,

We have an Elasticsearch (v7.5.3) cluster with 4 nodes (each is a VM with 16 CPUs, 32 GB of RAM and JVM heap size of 16GB) and all configured as data nodes.

Our main index has around 110M records (~70 GB) that we initially configured with 4 shards and 3 replicas.

At first, after filling the index, we did our preliminaries tests and everything looked fine and queries were pretty fast (we're mainly interested in aggregations and scroll downloading)… but after we started the scripts that do the inserts and (mainly) updates, with around 100 bulk requests per minute, the queries started lagging behind and almost always timing out.

We tried re-indexing with different configurations like 4 shards/1 replica and 24 shards/1 replica... but they seemed to perform even worse then the initial configuration.

Any suggestions please on what can be done to optimize this setup?

Thanks

Christian_Dahlqvist · February 22, 2021, 8:37am

When you are running bulk updates/inserts and queries slowed down have you tried to identify what is limiting performance? Is CPU saturated? Are you seeing long or frequent GC? How is disk I/O and iowait looking like? What type of storage do you have?

wissam · February 23, 2021, 10:32am

Thank you Christian for your prompt reply.

I monitored the system for some time (while I was getting the timeouts) but I couldn't identify where the bottleneck might be. I'll share some of the gathered metrics below:

iostat

cpu_avg (%user):
Server , Avg , Max , Min , StDev
Server1, 3.642, 13.48, 0.06, 2.985
Server2, 2.266 , 13.86 , 0 , 2.591
Server3, 2.812 , 24.34 , 0 , 3.338
Server4, 4.482 , 22.52 , 0 , 4.55

cpu_avg (%iowait):
Server , Avg , Max , Min , StDev
Server1, 17.757 , 55.63 , 0 , 10.494
Server2, 7.348 , 38.1 , 0 , 7.646
Server3, 8.797 , 32.5 , 0 , 7.041
Server4, 19.083 , 44.4 , 0 , 8.72

tps:
Server , Avg , Max , Min , StDev
Server1, 293 , 1524 , 0 , 230
Server2, 147 , 1463 , 0 , 249
Server3, 139 , 908 , 0 , 183
Server4, 70 , 763 , 0 , 96

kB_read/s:
Server , Avg , Max , Min , StDev
Server1, 4260 , 39156 , 0 , 5291
Server2, 3148 , 28136 , 0 , 4281
Server3, 3145 , 28828 , 0 , 4203
Server4, 4112 , 21752 , 0 , 3481

kB_wrtn/s:
Server , Avg , Max , Min , StDev
Server1, 1129 , 19309 , 0 , 2263
Server2, 690 , 13788 , 0 , 1755
Server3, 1104 , 127374 , 0 , 7605
Server4, 2020 , 60912 , 0 , 6775

GC Analysis :

Server1:
JVM memory size:
  Generation , Allocated , Peak
  Young Generation , 973.44 mb , 973.44 mb
  Old Generation: Allocated , 14.94 gb , 2.08 gb
  Meta Space: Allocated , 1.09 gb , 93.64 mb
  Young + Old + Meta space , Allocated: 16.98 gb , 3.11 gb

Throughput  : 99.603%
Latency:
  Avg Pause GC Time: 32.0 ms
  Max Pause GC Time: 60.0 ms

GC Pause Duration Time Range :
  Duration (ms),No. of GCs,Percentage
  0 - 10 , 2 , 0.96%
  10 - 20 , 13 , 6.25%
  20 - 30 , 139 , 66.83%
  30 - 40 , 49 , 23.56%
  50 - 60 , 3 , 1.44%
  60 - 70 , 2 , 0.96%

Server2:
JVM memory size:
  Generation , Allocated , Peak
  Young Generation , 973.44 mb , 973.44 mb
  Old Generation , 14.94 gb , 1.72 gb
  Meta Space , 1.08 gb , 92.22 mb
  Young + Old + Meta space , 16.98 gb , 2.73 gb

Throughput  : 99.797%
Latency:
  Avg Pause GC Time: 35.2 ms
  Max Pause GC Time: 190 ms

GC Pause Duration Time Range :
  Duration (ms),No. of GCs,Percentage
  0 - 100 , 86 , 98.85%
  100 - 200 , 1 , 1.15%

Server3:
JVM memory size:
  Generation , Allocated , Peak
  Young Generation , 973.44 mb , 973.44 mb
  Old Generation , 14.94 gb , 3.01 gb
  Meta Space , 1.08 gb , 92.42 mb
  Young + Old + Meta space , 16.98 gb , 4.01 gb
  
Throughput  : 99.637%
Latency:
  Avg Pause GC Time: 35.2 ms
  Max Pause GC Time: 120 ms

GC Pause Duration Time Range :
  Duration (ms) , No. of GCs , Percentage
  0 - 100 , 152 , 99.35%
  100 - 200 , 1 , 0.65%

Server4:
JVM memory size:
  Generation , Allocated , Peak 
  Young Generation , 973.44 mb , 973.44 mb
  Old Generation , 14.94 gb , 4.23 gb
  Meta Space , 1.09 gb , 94.99 mb
  Young + Old + Meta space , 16.98 gb , 5.28 gb

Throughput  : 99.537%
Latency:
  Avg Pause GC Time: 36.1 ms
  Max Pause GC Time: 70.0 ms

GC Pause Duration Time Range :
  Duration (ms), No. of GCs , Percentage
  0 - 10 , 2 , 1.02%
  10 - 20 , 15 , 7.61%
  20 - 30 , 82 , 41.62%
  30 - 40 , 68 , 34.52%
  50 - 60 , 20 , 10.15%
  60 - 70 , 8 , 4.06%
  70 - 80 , 2 , 1.02%

Christian_Dahlqvist · February 24, 2021, 10:21am

That is quite high iowait. What type of storage are you using?

wissam · February 24, 2021, 1:45pm

Hard Disk Drives

Christian_Dahlqvist · February 24, 2021, 2:00pm

First I would recommend you upgrade to the latest version as I believe there have been some improvements to update handling. Reducing the number of replicas should also help if you are having Io issues.

system · March 24, 2021, 2:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster optimization(indexing/query performace) Elasticsearch	4	323	July 6, 2017
Indexing rate performance in cluster Elasticsearch	6	3764	July 5, 2017
Performance degrading after a couple of weeks Elasticsearch	7	526	October 30, 2018
Elasticsearch Cluster Performance Tuning Help required Elasticsearch	15	696	December 30, 2018
Horizontal scaling of indexing Elasticsearch	8	2007	July 5, 2017

Index Dimensioning and Optimization (across the Cluster)

Related topics