Cluster has become unresponsive

Our ES service has all of a sudden become unresponsive.
HEAD is unable to connect to it, I am able to execute cluster health against it, but any stats just remain spinning, ES does not return the results.
Is there anyone that can help diagnose the problem?

are you still on 2.3.1 or 1.3.2 or newer?

only seeing timeouts, no root cause indication. Is this group only for those versions? Running on 6.4.1

share some logs, otherwise it's hard to troubleshoot without seeing anything...

This is what all the timeouts look like

[2019-01-23T16:55:38,324][WARN ][o.e.t.TransportService ] [uOeCjC6] Received response for a request that has timed out, sent [28687ms] ago, timed out [13687ms] ago, action [cluster:monitor/nodes/stats[n]], node [{9bcbJqh}{9bcbJqhRSbGJKbarmYDaYA}{b8lD9IQYQ_CrVkOaLD7BBA}{10.0.2.89}{10.0.2.89:9300}{aws_availability_zone=eu-west-1c, ml.machine_memory=32620150784, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], id [41894023]

I think it might have been caused by a high cardinality query returning a very large amount of unique items.
Regardless, if I query cluster health it returns green, but if I try and execute any node stats or other node ops, it just doesn't return. How do I get the cluster back to normal? PS cluster is docker deployment on AWS ECS

Adding more nodes to the data cluster brought all ES api's to life

Glad to see you got your cluster to be healthy again. In case you're interested, a cluster health request only normally goes to the master node, whereas things like node stats requests will fan out to every node in the cluster. If any node is too busy to respond then this kind of request can hang until that node recovers. A good thing to try is GET /_nodes/hot_threads since this gets a thread dump of every node in the cluster and will show you what the busy ones are busy doing, at a very low level. If you get into this state again then try that and share the output here.

Thanks David, will definitely add to my toolbox scripts!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.