Context:
- We are running ES Cluster (ES Version - "7.9.1") with 3 Master-Nodes & 8 data-nodes (AWS-"r5.2xlarge")
- We are using the default ES thread pool configuration
- There are 2 indices having 5 shards each (used by the given microservice) in which our data is stored, having 5 shards each
- Another Microservices are utilizing the same ES Cluster (having their separate indices)
Issue:
At peak load, many read & write requests on ES are timed-out (client-side socket timeout is very high 12s)
Cause of Issue: At peak load, lots of read/write requests are getting queued up waiting for a worker thread from the corresponding thread pool, hence the latency
Solutions:
- Increase to number of shards to at least the number of data nodes
- Have a separate number Cluster for this service so that resources are not shared by other services
Questions:
-
Can we increase the number of search & write threads in the pool by reducing the size of some other thread pools for e.g 'sql-write (we don't use it) & reducing the core size of many dynamic pools to 0 (which we don't generally use) ?
-
Any other recommendations?