When relocating/initializing 3 or more shards, ES is overloaded, REST API is very slow, Kibana and es-exporter cannot reach ES

danksim · August 25, 2020, 4:12pm

Hello,

I am running a TLS-enabled Elastic Stack 7.6.2 (currently running on EKS with kubernetes nodes on AWS m5.4xlarge machines 16CPUs 64GB - these are NOT SSD s) and whenever Elasticsearch relocates/initializes more than 3 shards simultaneously, Elasticsearch seems to be "overloaded" (REST APIs are very slow. /_cat/shards took more than a minute) and Kibana/es-exporter fail to retrieve information from Elasticsearch causing Kibana outage and gaps in metrics.

I confirmed this happens when /_cluster/health returns at least 3 of either relocating|initializing_shards:

"relocating_shards" : 3,
"initializing_shards": 0
or
"relocating_shards" : 2,
"initializing_shards":

Kibana logs:

# when relocating|initializing_shards count is 3 or above, Kibana cannot retrieve version infor from es nodes
{"type":"log","@timestamp":"2020-08-25T15:42:45Z","tags":["error","savedobjects-service"],"pid":6,"message":"Unable to retrieve version information from Elasticsearch nodes."}
...
# Kibana is unreachable (503 status code)
{"type":"log","@timestamp":"2020-08-25T15:42:45Z","tags":["status","plugin:snapshot_restore@7.6.2","error"],"pid":6,"state":"red","message":"Status changed from green to red - Unable to retrieve version information from Elasticsearch nodes.","prevState":"green","prevMsg":"Ready"}
...
# Kibana is unreachable (503 status code)
{"type":"log","@timestamp":"2020-08-25T15:42:45Z","tags":["status","plugin:security@7.6.2","info"],"pid":6,"state":"green","message":"Status changed from red to green - Ready","prevState":"red","prevMsg":"Unable to retrieve version information from Elasticsearch nodes."}
...
# Kibana is unreachable (503 status code)
{"type":"response","@timestamp":"2020-08-25T15:46:32Z","tags":[],"pid":6,"method":"get","statusCode":503,"req":{"url":"/app/kibana","method":"get","headers":{"user-agent":"curl/7.29.0","host":"localhost:5601","accept":"*/*"},"remoteAddress":"127.0.0.1","userAgent":"127.0.0.1"},"res":{"statusCode":503,"responseTime":112,"contentLength":9},"message":"GET /app/kibana 503 112ms - 9.0B"}
...
# once relocating|initializing_shards count goes down, Kibana is back up (200 status code)
{"type":"response","@timestamp":"2020-08-25T15:47:52Z","tags":[],"pid":6,"method":"get","statusCode":200,"req":{"url":"/app/kibana","method":"get","headers":{"user-agent":"curl/7.29.0","host":"localhost:5601","accept":"*/*"},"remoteAddress":"127.0.0.1","userAgent":"127.0.0.1"},"res":{"statusCode":200,"responseTime":127,"contentLength":9},"message":"GET /app/kibana 200 127ms - 9.0B"}

es-exporter logs:

# when relocating|initializing_shards count is 3 or above, es-exporter cannot fetch metrics
level=warn ts=2020-08-25T15:51:25.15613212Z caller=indices.go:1061 msg="failed to fetch and decode index stats" err="failed to get index stats from https://my-elastic-stack-coordinator:9200/_all/_stats: Get https://elastic:***@my-elastic-stack-coordinator:9200/_all/_stats: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

Cluster setup:

coordinator:
  replicas: 9
  jvm heap: 8gb
  cpu: 3
  memory 16gb
master:
  replicas: 3
  jvm heap: 4gb
  cpu: 2
  memory 8gb
data:
  replicas: 12
  jvm heap: 8gb
  cpu: 6
  memory 16gb

Average usages:

cpu: 10% across all data nodes
jvm heap: 46% across all data nodes

Numbers for indices/shards:

538 indices total - 6 shards for hot, 2 shards for warm, 1 replica for all
4764 shards total (2382 primary)

Some relevant cluster/indices settings:

cluster.routing.allocation.cluster_concurrent_rebalance: 2

indices.recovery.max_bytes_per_sec: 1028mb (tried 15mb, default 40mb, 200mb as well)
indices.recovery.max_concurrent_file_chunks: 5 (tried default 2 as well)

all throttle_times are zeros for /<index-name-of-relocating|initializing-shard>/_recovery:

"source_throttle_time_in_millis" : 0,
"target_throttle_time_in_millis" : 0

Also, I saw this thread but transport.tcp.compress was set to false by default

What else can I do to enhance performance so that it can relocate/initialize more shards without getting overloaded?

DavidTurner · August 25, 2020, 6:07pm

This is a terrible idea. Spinning disks won't be able to handle this kind of recovery rate and will fall over in their attempts to achieve it.

That's also a bad sign, you want recoveries to be throttled, otherwise they will consume all available resources and leave nothing for the rest of the cluster. Set indices.recovery.max_bytes_per_sec low enough that your recoveries are seeing appreciable throttling.

By default each node will only involve itself in two recoveries at once, and this default is a good one. Don't increase it. Especially if you only have spinning disks to work with, they don't like concurrent workloads at all.

danksim · August 27, 2020, 11:06am

I was experimenting. I have tried 15mb, default 40mb, 200mb as well. Will set it back to default.

Ok. I won't increase it. What else do you recommend I try now?

DavidTurner · August 27, 2020, 1:00pm

This is the best thing to try:

Look at how fast your recoveries can actually go and then set the limit a bit lower so they're not completely saturating your disks and the throttle is doing its job.

danksim · August 28, 2020, 3:11pm

Thanks, @DavidTurner, I am seeing lots of decision: THROTTLED in /explain for shards after lowering the value for indices.recovery.max_bytes_per_sec and it seems to have resolved the overloaded cluster.

DavidTurner · August 28, 2020, 3:52pm

I'm glad to hear your problem is resolved, but TBC the THROTTLED you are seeing in the cluster allocation explain output is referring to a completely different throttle from the one that indices.recovery.max_bytes_per_sec affects. You should be looking at the {source,target}_throttle_time_in_millis values from the recovery API instead.

system · September 25, 2020, 3:52pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Large amounts of shards failing Elasticsearch	6	2073	July 5, 2017
Shards stuck in Initializing mode Elasticsearch	3	3961	July 5, 2017
Elasticsearch continually fails after upgrade Elasticsearch	8	1079	July 11, 2019
Discover: An error occurred with your request. Reset your inputs and try again Kibana	6	3290	July 6, 2017
ES instance restart causing shard initialization Elasticsearch	7	2077	July 5, 2017

When relocating/initializing 3 or more shards, ES is overloaded, REST API is very slow, Kibana and es-exporter cannot reach ES

Related topics