Half-dead node lead to cluster hang

ginger · December 31, 2017, 1:42pm

Elasticsearch version (bin/elasticsearch --version):
5.6.4

JVM version (java -version):
1.8.0_91

Description of the problem including expected versus actual behavior:
In production enviroment, we have encounter hardware failure serveral times, which cause one or more nodes to half-dead, then the whole cluster hang.

Elasticsearch cluster:
3 nodes: 24 Cores, 128GB memory, 31GB heap

Steps to reproduce:
We use tc cmd to simulate the hardware failure and reproduce the problem:

start the cluster
do some heavy index(50%~ CPU)
use tc cmd to randomly drop packet:

tc  qdisc  add  dev  eth0  root  netem  loss  50%

Has anyone encountered similar problem? Any idea to tolerate such hardware failure?

Thanks : )

warkolm · January 1, 2018, 12:03am

What do you mean by "half-dead"? What do the logs show?

ginger · January 2, 2018, 2:00am

Thanks for your reply.

I mean that some machines or ethernet switches works abnormally. For example, those hardware can lost 50% network packet.

ginger · January 29, 2018, 6:59am

Any reply is appreciated.

Thanks

warkolm · January 29, 2018, 7:05am

If you have a non-reliable network, I am not sure what you can do?

ginger · January 29, 2018, 7:12am

In production clusters, this problem usually occurs when hardware failure. I just use tc cmd to simulate the hardware failure and reproduce the problem.

ginger · February 1, 2018, 2:12am

We update discovery.zen.fd.ping_timeout configure to 2s and fix this problem for data node in 3-nodes test cluster. In my opinion, this is mainly because we can remove the data node from the cluster and reallocate shards as soon as possible.

But for active master node, this change does not work. The active master removed some datanodes and those data nodes come back soon. The active master does not die, and no new master is elected.

ginger · February 1, 2018, 4:30am

Any reply is appreciated!

ginger · February 2, 2018, 11:46am

I wonder that dose update discovery.zen.fd.ping_timeout to 2s has some risk? Any advice on large scale cluster?

warkolm · February 2, 2018, 8:51pm

That is rather low, yes. Perhaps you can post your config and we can check it?

ginger · February 3, 2018, 2:20am

Our cluster has 100+ nodes, example config is as following:

cluster.name: es_xxx
node.data: true
node.ingest: true
node.master: false
node.name: data_node_1
path.data: ./data
processors: 16
indices.memory.index_buffer_size: 15%
node.attr.region: 99
node.attr.set: 25
node.attr.rack: 109699
node.attr.ip: {ip}
cluster.routing.allocation.awareness.attributes: ip
network.host: 0.0.0.0
network.publish_host: {ip}
http.port: 9201
transport.tcp.port: 9301
discovery.zen.ping.unicast.hosts: ["{ip1}:9301","{ip2}:9301","{ip3}:9301","{ip4}:9301","{ip5}:9301"]
discovery.zen.minimum_master_nodes: 3
bootstrap.seccomp: false
discovery.zen.fd.ping_timeout: 2s

Christian_Dahlqvist · February 3, 2018, 1:02pm

Setting the ping timeout that low could cause a lot of problems as any long GC could cause the node to drop out. Sounds a bit risky to me, especially with a cluster that size.

What type of hardware failures are causing these problems? What type of hardware is the cluster deployed on?

warkolm · February 3, 2018, 8:19pm

How many masters do you have?

Why are you doing this?

ginger · February 4, 2018, 1:39pm

What type of hardware is the cluster deployed on?

It's physical machines with local SSD disks.

What type of hardware failures are causing these problems?

One machine lost connection from other nodes or reboot. It's rather easy to use tc cmd to reproduce this problem in 3-nodes test cluster. In my opinion, the bad node isn't removed by the master node util 90s ping timeout, during which many bulk requests flood other nodes and cause old gc.

ginger · February 4, 2018, 1:49pm

5 master nodes as the configure above.

We know it's not secure, but we have some problems with this check.

ginger · February 6, 2018, 1:57am

This is exactly correct. We have noticed that there is some long gc(about 9s) in our product cluster. Setting the ping timeout that low is really a risky.

ginger · February 7, 2018, 12:51am

Any reply is appreciated!

ywelsch · February 7, 2018, 10:00am

It sounds like the network connection remains half-open (for causes, see e.g. Detection of Half-Open (Dropped) Connections), i.e., the node fault detection on the master does not notice that the connection was closed. It will then take discovery.zen.fd.ping_retries (3) * discovery.zen.fd.ping_timeout (30s) = 90 seconds to notice that the (data) node has become unavailable. Note that this is a rare event and usually indicates a hardware error. If 90 seconds is too long, you can lower those settings, with the risk that long garbage collection cycles can make your nodes being dropped by the master. Setting discovery.zen.fd.ping_timeout to 2s might be a bit too extreme, but values in the range of 5-10s (with 3 retries) should be ok.

But for active master node, this change does not work. The active master removed some datanodes and those data nodes come back soon. The active master does not die, and no new master is elected.

Can you provide more information on this? How do these nodes come back? Can you provide logs from the active master node?

We have noticed that there is some long gc(about 9s) in our product cluster. Setting the ping timeout that low is really a risky.

Have you investigated how to avoid those long GC cycles? What exactly is causing them? Is it due to client requests flooding the data nodes? Have you implement a client-side backoff strategy?

ginger · February 7, 2018, 2:42pm

We are testing it with discovery.zen.fd.ping_timeout setting to 6s.

When using tc cmd to simulate hardware failure, the master node doesn't really die. The removed data nodes can come back through some alive data nodes. We will provide log soon.

We indexed data heavily and about half heap memory is used(segment memory/bulk/search cache). We haven't implement a a client-side backoff strategy now.

ginger · February 8, 2018, 2:31am

I send the master log via mail. The master log contains the online logs when one data-node machine failure.

Topic		Replies	Views
Zen ping timeout causes nodes to lose master permanently Elasticsearch	4	740	July 6, 2017
Master node failure causes cluster to fail Elasticsearch	3	1679	July 6, 2017
Hung node, cluster state green Elasticsearch	6	1166	July 6, 2017
Max latency between nodes Elasticsearch	6	3172	July 6, 2017
Cluster stalls when nodes are removed (or the true meaning of expected_nodes) Elasticsearch	10	543	July 6, 2017

Half-dead node lead to cluster hang

Related topics