Elasticsearch version (bin/elasticsearch --version):
5.6.4
JVM version (java -version):
1.8.0_91
Description of the problem including expected versus actual behavior:
In production enviroment, we have encounter hardware failure serveral times, which cause one or more nodes to half-dead, then the whole cluster hang.
In production clusters, this problem usually occurs when hardware failure. I just use tc cmd to simulate the hardware failure and reproduce the problem.
We update discovery.zen.fd.ping_timeout configure to 2s and fix this problem for data node in 3-nodes test cluster. In my opinion, this is mainly because we can remove the data node from the cluster and reallocate shards as soon as possible.
But for active master node, this change does not work. The active master removed some datanodes and those data nodes come back soon. The active master does not die, and no new master is elected.
Setting the ping timeout that low could cause a lot of problems as any long GC could cause the node to drop out. Sounds a bit risky to me, especially with a cluster that size.
What type of hardware failures are causing these problems? What type of hardware is the cluster deployed on?
What type of hardware failures are causing these problems?
One machine lost connection from other nodes or reboot. It's rather easy to use tc cmd to reproduce this problem in 3-nodes test cluster. In my opinion, the bad node isn't removed by the master node util 90s ping timeout, during which many bulk requests flood other nodes and cause old gc.
This is exactly correct. We have noticed that there is some long gc(about 9s) in our product cluster. Setting the ping timeout that low is really a risky.
It sounds like the network connection remains half-open (for causes, see e.g. Detection of Half-Open (Dropped) Connections), i.e., the node fault detection on the master does not notice that the connection was closed. It will then take discovery.zen.fd.ping_retries (3) * discovery.zen.fd.ping_timeout (30s) = 90 seconds to notice that the (data) node has become unavailable. Note that this is a rare event and usually indicates a hardware error. If 90 seconds is too long, you can lower those settings, with the risk that long garbage collection cycles can make your nodes being dropped by the master. Setting discovery.zen.fd.ping_timeout to 2s might be a bit too extreme, but values in the range of 5-10s (with 3 retries) should be ok.
But for active master node, this change does not work. The active master removed some datanodes and those data nodes come back soon. The active master does not die, and no new master is elected.
Can you provide more information on this? How do these nodes come back? Can you provide logs from the active master node?
We have noticed that there is some long gc(about 9s) in our product cluster. Setting the ping timeout that low is really a risky.
Have you investigated how to avoid those long GC cycles? What exactly is causing them? Is it due to client requests flooding the data nodes? Have you implement a client-side backoff strategy?
We are testing it with discovery.zen.fd.ping_timeout setting to 6s.
When using tc cmd to simulate hardware failure, the master node doesn't really die. The removed data nodes can come back through some alive data nodes. We will provide log soon.
We indexed data heavily and about half heap memory is used(segment memory/bulk/search cache). We haven't implement a a client-side backoff strategy now.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.