ElasticSearch Read/Write performance issue on Peak Load

Context:

  1. We are running ES Cluster (ES Version - "7.9.1") with 3 Master-Nodes & 8 data-nodes (AWS-"r5.2xlarge")
  2. We are using the default ES thread pool configuration
  3. There are 2 indices having 5 shards each (used by the given microservice) in which our data is stored, having 5 shards each
  4. Another Microservices are utilizing the same ES Cluster (having their separate indices)

Issue:
At peak load, many read & write requests on ES are timed-out (client-side socket timeout is very high 12s)

Cause of Issue: At peak load, lots of read/write requests are getting queued up waiting for a worker thread from the corresponding thread pool, hence the latency

Solutions:

  1. Increase to number of shards to at least the number of data nodes
  2. Have a separate number Cluster for this service so that resources are not shared by other services

Questions:

  1. Can we increase the number of search & write threads in the pool by reducing the size of some other thread pools for e.g 'sql-write (we don't use it) & reducing the core size of many dynamic pools to 0 (which we don't generally use) ?

  2. Any other recommendations?

First try to identify what is the bottleneck. Indexing is quite I/O intensive, so I would first look at storage performance and iowait. What type of storage are you using? What is the load on it?

How many indices and shards are you actively indexing into (in the cluster as a whole).

What is the load (qps and wps) at which you start seeing this performance degradation?

Storage Type is "Elastic Block Storage" And "EBS volume type" is SSD (general Purpose)

Read Latency:

Write Latency:

Read Throuput:

Write througput:

Disk Queue depth:

Read IOPS:

Write IOPS:

Past 2 weeks stats

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.