Half-dead node lead to cluster hang

Elasticsearch version (bin/elasticsearch --version):
5.6.4

JVM version (java -version):
1.8.0_91

Description of the problem including expected versus actual behavior:
In production enviroment, we have encounter hardware failure serveral times, which cause one or more nodes to half-dead, then the whole cluster hang.

Elasticsearch cluster:
3 nodes: 24 Cores, 128GB memory, 31GB heap

Steps to reproduce:
We use tc cmd to simulate the hardware failure and reproduce the problem:

  1. start the cluster
  2. do some heavy index(50%~ CPU)
  3. use tc cmd to randomly drop packet:
tc  qdisc  add  dev  eth0  root  netem  loss  50%

Has anyone encountered similar problem? Any idea to tolerate such hardware failure?

Thanks : )

What do you mean by "half-dead"? What do the logs show?

Thanks for your reply.

I mean that some machines or ethernet switches works abnormally. For example, those hardware can lost 50% network packet.

Any reply is appreciated.

Thanks

If you have a non-reliable network, I am not sure what you can do?

In production clusters, this problem usually occurs when hardware failure. I just use tc cmd to simulate the hardware failure and reproduce the problem.

We update discovery.zen.fd.ping_timeout configure to 2s and fix this problem for data node in 3-nodes test cluster. In my opinion, this is mainly because we can remove the data node from the cluster and reallocate shards as soon as possible.

But for active master node, this change does not work. The active master removed some datanodes and those data nodes come back soon. The active master does not die, and no new master is elected.

Any reply is appreciated!

I wonder that dose update discovery.zen.fd.ping_timeout to 2s has some risk? Any advice on large scale cluster?

That is rather low, yes. Perhaps you can post your config and we can check it?

Our cluster has 100+ nodes, example config is as following:

cluster.name: es_xxx
node.data: true
node.ingest: true
node.master: false
node.name: data_node_1
path.data: ./data
processors: 16
indices.memory.index_buffer_size: 15%
node.attr.region: 99
node.attr.set: 25
node.attr.rack: 109699
node.attr.ip: {ip}
cluster.routing.allocation.awareness.attributes: ip
network.host: 0.0.0.0
network.publish_host: {ip}
http.port: 9201
transport.tcp.port: 9301
discovery.zen.ping.unicast.hosts: ["{ip1}:9301","{ip2}:9301","{ip3}:9301","{ip4}:9301","{ip5}:9301"]
discovery.zen.minimum_master_nodes: 3
bootstrap.seccomp: false
discovery.zen.fd.ping_timeout: 2s

Setting the ping timeout that low could cause a lot of problems as any long GC could cause the node to drop out. Sounds a bit risky to me, especially with a cluster that size.

What type of hardware failures are causing these problems? What type of hardware is the cluster deployed on?

How many masters do you have?

Why are you doing this?

What type of hardware is the cluster deployed on?

It's physical machines with local SSD disks.

What type of hardware failures are causing these problems?

One machine lost connection from other nodes or reboot. It's rather easy to use tc cmd to reproduce this problem in 3-nodes test cluster. In my opinion, the bad node isn't removed by the master node util 90s ping timeout, during which many bulk requests flood other nodes and cause old gc.

5 master nodes as the configure above.

We know it's not secure, but we have some problems with this check.

This is exactly correct. We have noticed that there is some long gc(about 9s) in our product cluster. Setting the ping timeout that low is really a risky.

Any reply is appreciated!

It sounds like the network connection remains half-open (for causes, see e.g. Detection of Half-Open (Dropped) Connections), i.e., the node fault detection on the master does not notice that the connection was closed. It will then take discovery.zen.fd.ping_retries (3) * discovery.zen.fd.ping_timeout (30s) = 90 seconds to notice that the (data) node has become unavailable. Note that this is a rare event and usually indicates a hardware error. If 90 seconds is too long, you can lower those settings, with the risk that long garbage collection cycles can make your nodes being dropped by the master. Setting discovery.zen.fd.ping_timeout to 2s might be a bit too extreme, but values in the range of 5-10s (with 3 retries) should be ok.

But for active master node, this change does not work. The active master removed some datanodes and those data nodes come back soon. The active master does not die, and no new master is elected.

Can you provide more information on this? How do these nodes come back? Can you provide logs from the active master node?

We have noticed that there is some long gc(about 9s) in our product cluster. Setting the ping timeout that low is really a risky.

Have you investigated how to avoid those long GC cycles? What exactly is causing them? Is it due to client requests flooding the data nodes? Have you implement a client-side backoff strategy?

1 Like

We are testing it with discovery.zen.fd.ping_timeout setting to 6s.

When using tc cmd to simulate hardware failure, the master node doesn't really die. The removed data nodes can come back through some alive data nodes. We will provide log soon.

We indexed data heavily and about half heap memory is used(segment memory/bulk/search cache). We haven't implement a a client-side backoff strategy now.

I send the master log via mail. The master log contains the online logs when one data-node machine failure.