we have Elaticsearch cluster that running over a year with 54 data nodes and 3 master nodes. ES version is 5.2.1.
in the last month, some data node are left the cluster with the following error:
failed to ping, tried [3] times, each with maximum [30s] timeout.
master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
< the list of es nodes. the master that left is in the list>.
in the master logs
Cluster health status changed from [GREEN] to [YELLOW] (reason: [{es-data-18}{10.240.16.128:9300} failed to ping, tried [3] times, each with maximum [30s] timeout]).
[2019-01-02T07:02:45,741][INFO ][o.e.c.s.ClusterService ] [es-master-2] removed {{es-data-18}{-}{}{es-prod-data-18}{10.240.16.128:9300},}, reason: zen-disco-node-failed({es-data-18}{}{es-prod-data-18}{10.240.16.128:9300}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)
after that the cluster continue to recovering process.
the issue happens each time on one node only, and each time on a different node.
we don't see any change in cpu/memory or network rate before the issue. we have a larger and separate cluster in the same network with no issue, so we don't believe its a network issue.
anything that may cause to this issue in this version? or any other logs we can check in order to figure it out?
It's very likely to be a network issue. es-data-18 tried to send a short "ping" message to es-master-2 and received no response within 30 seconds, and this happened three times in a row. Simultaneously, the master tried to send a short "ping" message to es-data-18 and this also failed three times in a row. Neither node seems to have been unresponsive for this period.
You can see the individual failures with these logger settings:
Note that these "ping" messages are sent on long-lived connections which sometimes behave differently from a simple ICMP ping.
How are these clusters hosted? Note that on some cloud providers the connectivity between instances can be affected by many things (e.g. more expensive instances have better connectivity). Be sure you are comparing like-for-like when looking at differences in behaviour between your two clusters.
thanks. the hosts are managed by us in google compute engine, and the specification is the same for both clusters.
The fact that we already replaced the masters and the fact it happens only on one node each time (i.e - all others nodes are succeed to connect the master) made us believe its not a network issue. we also don't see any issue in our metrics regarding networking timeouts or issues. in addition, es-master-2 and es-prod-18 success to communicate other nodes outside the cluster, like our monitoring server.
Is there any es limit in the master/cluster level that can cause to this issue?
Not that I can think of. The pings we're discussing are a pretty basic connectivity/liveness check. They're a few bytes long and don't even do anything as complicated as spawning a separate thread when they're handled.
When you say that you don't detect any problems with your own monitoring, is this monitoring system holding connections open for a long time (hours/days/... comparable with the frequency of this issue occurring)? I've often seen situations where newly-established connections work fine but long-lived connections do not, so make sure you're monitoring the same thing.
How often does this issue occur?
Do you see any log messages which contain the string Received response for a request that has timed out? If you run with the logger settings I gave above, do you see any failures that don't lead to a zen-disco-node-failed event?
Hi David, sorry for the late response and thanks for your help. trying to find a way to define logger.org.elasticsearch.discovery.zen.NodesFaultDetection. is it dynamic or we should restart the cluster in order to put it?
regarding the other questions:
we don't see the received response for a request that has timed out message. the only messages we see are those I attached above.
the issue occur 2-3 times in 24 hours
the monitoring system is prometheus, so its scrapping the machine every 45 seconds. so you right and its not a good case to compare to. but as I understand its sending ping - i.e its not based on the long-live connection. am I wrong?
"Ping" is a generic term for this kind of simple health check. The ping command works using ICMP echo-request and echo-reply messages that are indeed connectionless, but that's not what's happening here. The pings that this error message is discussing are transport messages, sent across one of its long-lived node-to-node TCP connections, similar to the IRC message of the same name.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.