When relocating/initializing 3 or more shards, ES is overloaded, REST API is very slow, Kibana and es-exporter cannot reach ES

Hello,

I am running a TLS-enabled Elastic Stack 7.6.2 (currently running on EKS with kubernetes nodes on AWS m5.4xlarge machines 16CPUs 64GB - these are NOT SSD s) and whenever Elasticsearch relocates/initializes more than 3 shards simultaneously, Elasticsearch seems to be "overloaded" (REST APIs are very slow. /_cat/shards took more than a minute) and Kibana/es-exporter fail to retrieve information from Elasticsearch causing Kibana outage and gaps in metrics.

I confirmed this happens when /_cluster/health returns at least 3 of either relocating|initializing_shards:

"relocating_shards" : 3,
"initializing_shards": 0
or
"relocating_shards" : 2,
"initializing_shards":

Kibana logs:

# when relocating|initializing_shards count is 3 or above, Kibana cannot retrieve version infor from es nodes
{"type":"log","@timestamp":"2020-08-25T15:42:45Z","tags":["error","savedobjects-service"],"pid":6,"message":"Unable to retrieve version information from Elasticsearch nodes."}
...
# Kibana is unreachable (503 status code)
{"type":"log","@timestamp":"2020-08-25T15:42:45Z","tags":["status","plugin:snapshot_restore@7.6.2","error"],"pid":6,"state":"red","message":"Status changed from green to red - Unable to retrieve version information from Elasticsearch nodes.","prevState":"green","prevMsg":"Ready"}
...
# Kibana is unreachable (503 status code)
{"type":"log","@timestamp":"2020-08-25T15:42:45Z","tags":["status","plugin:security@7.6.2","info"],"pid":6,"state":"green","message":"Status changed from red to green - Ready","prevState":"red","prevMsg":"Unable to retrieve version information from Elasticsearch nodes."}
...
# Kibana is unreachable (503 status code)
{"type":"response","@timestamp":"2020-08-25T15:46:32Z","tags":[],"pid":6,"method":"get","statusCode":503,"req":{"url":"/app/kibana","method":"get","headers":{"user-agent":"curl/7.29.0","host":"localhost:5601","accept":"*/*"},"remoteAddress":"127.0.0.1","userAgent":"127.0.0.1"},"res":{"statusCode":503,"responseTime":112,"contentLength":9},"message":"GET /app/kibana 503 112ms - 9.0B"}
...
# once relocating|initializing_shards count goes down, Kibana is back up (200 status code)
{"type":"response","@timestamp":"2020-08-25T15:47:52Z","tags":[],"pid":6,"method":"get","statusCode":200,"req":{"url":"/app/kibana","method":"get","headers":{"user-agent":"curl/7.29.0","host":"localhost:5601","accept":"*/*"},"remoteAddress":"127.0.0.1","userAgent":"127.0.0.1"},"res":{"statusCode":200,"responseTime":127,"contentLength":9},"message":"GET /app/kibana 200 127ms - 9.0B"}

es-exporter logs:

# when relocating|initializing_shards count is 3 or above, es-exporter cannot fetch metrics
level=warn ts=2020-08-25T15:51:25.15613212Z caller=indices.go:1061 msg="failed to fetch and decode index stats" err="failed to get index stats from https://my-elastic-stack-coordinator:9200/_all/_stats: Get https://elastic:***@my-elastic-stack-coordinator:9200/_all/_stats: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

Cluster setup:

coordinator:
  replicas: 9
  jvm heap: 8gb
  cpu: 3
  memory 16gb
master:
  replicas: 3
  jvm heap: 4gb
  cpu: 2
  memory 8gb
data:
  replicas: 12
  jvm heap: 8gb
  cpu: 6
  memory 16gb

Average usages:

cpu: 10% across all data nodes
jvm heap: 46% across all data nodes

Numbers for indices/shards:

538 indices total - 6 shards for hot, 2 shards for warm, 1 replica for all
4764 shards total (2382 primary)

Some relevant cluster/indices settings:

cluster.routing.allocation.cluster_concurrent_rebalance: 2

indices.recovery.max_bytes_per_sec: 1028mb (tried 15mb, default 40mb, 200mb as well)
indices.recovery.max_concurrent_file_chunks: 5 (tried default 2 as well)

all throttle_times are zeros for /<index-name-of-relocating|initializing-shard>/_recovery:

"source_throttle_time_in_millis" : 0,
"target_throttle_time_in_millis" : 0

Also, I saw this thread but transport.tcp.compress was set to false by default

What else can I do to enhance performance so that it can relocate/initialize more shards without getting overloaded?

This is a terrible idea. Spinning disks won't be able to handle this kind of recovery rate and will fall over in their attempts to achieve it.

That's also a bad sign, you want recoveries to be throttled, otherwise they will consume all available resources and leave nothing for the rest of the cluster. Set indices.recovery.max_bytes_per_sec low enough that your recoveries are seeing appreciable throttling.

By default each node will only involve itself in two recoveries at once, and this default is a good one. Don't increase it. Especially if you only have spinning disks to work with, they don't like concurrent workloads at all.

I was experimenting. I have tried 15mb, default 40mb, 200mb as well. Will set it back to default.

Ok. I won't increase it. What else do you recommend I try now?

This is the best thing to try:

Look at how fast your recoveries can actually go and then set the limit a bit lower so they're not completely saturating your disks and the throttle is doing its job.

Thanks, @DavidTurner, I am seeing lots of decision: THROTTLED in /explain for shards after lowering the value for indices.recovery.max_bytes_per_sec and it seems to have resolved the overloaded cluster.

I'm glad to hear your problem is resolved, but TBC the THROTTLED you are seeing in the cluster allocation explain output is referring to a completely different throttle from the one that indices.recovery.max_bytes_per_sec affects. You should be looking at the {source,target}_throttle_time_in_millis values from the recovery API instead.

1 Like