Hello,
I am running a TLS
-enabled Elastic Stack
7.6.2 (currently running on EKS
with kubernetes
nodes on AWS
m5.4xlarge
machines 16CPUs
64GB
- these are NOT SSD
s) and whenever Elasticsearch
relocates/initializes more than 3 shards simultaneously, Elasticsearch
seems to be "overloaded" (REST APIs are very slow. /_cat/shards
took more than a minute) and Kibana
/es-exporter
fail to retrieve information from Elasticsearch
causing Kibana
outage and gaps in metrics.
I confirmed this happens when /_cluster/health
returns at least 3 of either relocating|initializing_shards
:
"relocating_shards" : 3,
"initializing_shards": 0
or
"relocating_shards" : 2,
"initializing_shards":
Kibana
logs:
# when relocating|initializing_shards count is 3 or above, Kibana cannot retrieve version infor from es nodes
{"type":"log","@timestamp":"2020-08-25T15:42:45Z","tags":["error","savedobjects-service"],"pid":6,"message":"Unable to retrieve version information from Elasticsearch nodes."}
...
# Kibana is unreachable (503 status code)
{"type":"log","@timestamp":"2020-08-25T15:42:45Z","tags":["status","plugin:snapshot_restore@7.6.2","error"],"pid":6,"state":"red","message":"Status changed from green to red - Unable to retrieve version information from Elasticsearch nodes.","prevState":"green","prevMsg":"Ready"}
...
# Kibana is unreachable (503 status code)
{"type":"log","@timestamp":"2020-08-25T15:42:45Z","tags":["status","plugin:security@7.6.2","info"],"pid":6,"state":"green","message":"Status changed from red to green - Ready","prevState":"red","prevMsg":"Unable to retrieve version information from Elasticsearch nodes."}
...
# Kibana is unreachable (503 status code)
{"type":"response","@timestamp":"2020-08-25T15:46:32Z","tags":[],"pid":6,"method":"get","statusCode":503,"req":{"url":"/app/kibana","method":"get","headers":{"user-agent":"curl/7.29.0","host":"localhost:5601","accept":"*/*"},"remoteAddress":"127.0.0.1","userAgent":"127.0.0.1"},"res":{"statusCode":503,"responseTime":112,"contentLength":9},"message":"GET /app/kibana 503 112ms - 9.0B"}
...
# once relocating|initializing_shards count goes down, Kibana is back up (200 status code)
{"type":"response","@timestamp":"2020-08-25T15:47:52Z","tags":[],"pid":6,"method":"get","statusCode":200,"req":{"url":"/app/kibana","method":"get","headers":{"user-agent":"curl/7.29.0","host":"localhost:5601","accept":"*/*"},"remoteAddress":"127.0.0.1","userAgent":"127.0.0.1"},"res":{"statusCode":200,"responseTime":127,"contentLength":9},"message":"GET /app/kibana 200 127ms - 9.0B"}
es-exporter
logs:
# when relocating|initializing_shards count is 3 or above, es-exporter cannot fetch metrics
level=warn ts=2020-08-25T15:51:25.15613212Z caller=indices.go:1061 msg="failed to fetch and decode index stats" err="failed to get index stats from https://my-elastic-stack-coordinator:9200/_all/_stats: Get https://elastic:***@my-elastic-stack-coordinator:9200/_all/_stats: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
Cluster setup:
coordinator:
replicas: 9
jvm heap: 8gb
cpu: 3
memory 16gb
master:
replicas: 3
jvm heap: 4gb
cpu: 2
memory 8gb
data:
replicas: 12
jvm heap: 8gb
cpu: 6
memory 16gb
Average usages:
cpu: 10% across all data nodes
jvm heap: 46% across all data nodes
Numbers for indices
/shards
:
538 indices total - 6 shards for hot, 2 shards for warm, 1 replica for all
4764 shards total (2382 primary)
Some relevant cluster
/indices
settings:
cluster.routing.allocation.cluster_concurrent_rebalance: 2
indices.recovery.max_bytes_per_sec: 1028mb (tried 15mb, default 40mb, 200mb as well)
indices.recovery.max_concurrent_file_chunks: 5 (tried default 2 as well)
all throttle_time
s are zeros for /<index-name-of-relocating|initializing-shard>/_recovery
:
"source_throttle_time_in_millis" : 0,
"target_throttle_time_in_millis" : 0
Also, I saw this thread but transport.tcp.compress
was set to false
by default
What else can I do to enhance performance so that it can relocate/initialize more shards without getting overloaded?