Half-dead node lead to cluster hang

ginger · February 8, 2018, 3:35am

BTW, some similar phenomenons as https://github.com/elastic/elasticsearch/issues/27194:

Through we only have 31shards in test cluster, we observe a lot of shard failure tasks:
225 add_listener
1 cluster_reroute(async_shard_fetch)
1 delayed_allocation_reroute
6 master
106945 shard-failed
1 zen-disco-node-failed
21 zen-disco-node-join
Killing a node normally does not reproduce the problem. The recovery in that scenario is normal, a reasonable amount of shard-failed tasks and quick recovery.

ywelsch · February 8, 2018, 4:29pm

I did not receive those logs. Please send them to my elastic.co e-mail address (yannick@...).

ginger · February 9, 2018, 1:20am

Sorry for sending a wrong mail address. I send it again.

ginger · February 9, 2018, 12:01pm

@ywelsch
We try to refuse bulk request after high JVM old generation usage in a plugin. The test cluster won't hang, but we observed an interesting problem, which may be the real problem: 5 minutes after the above tc cmd, many transport_server_worker threads of the master is blocked by ShardFailedTransportHandler.messageReceived() method, and sometimes call of the master transport address cost 10+s.

We wonder that the master of the online cluster, whose log is provided before, was also blocked by this problem, that's why a new master was elected.

The full jstack:

gist.github.com

https://gist.github.com/jgq2008303393/6b3224aad88888b17bf39d2c642cbcbd

es_jstack.js

2018-02-09 17:59:20
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.112-b15 mixed mode):

"elasticsearch[1517918957000000309][generic][T#37]" #157 daemon prio=5 os_prio=0 tid=0x00007fed2407b800 nid=0x8e97 waiting on condition [0x00007fec8c2f0000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00007feee6551ab0> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:734)
	at java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)

This file has been truncated. show original

ywelsch · February 9, 2018, 2:10pm

That's definitely an interesting theory. The stack trace indicates that the amount of "shard failed" events that are handled on the master look to overwhelm it and block network threads. The logs from the production cluster are inconclusive, they show that the master node has trouble reconnecting to the rest of the cluster, but there are also minutes going by without any log information. In case where this happens again in production, can you take stack dumps / heap dumps of the master?

Can you also provide more information on all the steps you took to reproduce this, in particular the ones which led to the 106945 shard-failed events in the test cluster? Thanks.

ginger · February 9, 2018, 3:11pm

We will share the heap dump of the master next time.

Our new test cluster enviroment:
Elasticsearch cluster:
3 machines each with 4 data nodes: 12 Cores4, 64GB4 memory
Choose another machine as dedicated master
Allocation awareness: cluster.routing.allocation.awareness.attributes: ip

Elasticsearch plugin:
Reject bulk request and shard bulk request when high JVM old generation usage large then 85%.

Steps to reproduce:
We use tc cmd to simulate the hardware failure and reproduce the problem:

Start the cluster
Create an index with 12 shards, each with 1 replica
Do heavy index on data nodes: multiple bulks concurrent on all data nodes continuely, each with 5k document, bulks cost 50% CPU of all data machine.
Use tc cmd to randomly drop packet of one data machine:

tc  qdisc  add  dev  eth0  root  netem  loss  50%

ginger · February 9, 2018, 3:13pm

The code seems that it doesn't deduplicate shard failed tasks. It would be nice if you could provide a patch that deduplicate those task, so that we could check the theory right or not.

ginger · February 11, 2018, 5:32am

@ywelsch We find that checking for dups are based on task identity in TaskBatcher.java, which will lead to too many duplicated shard failed tasks in _cat/pending_tasks. After changing IdentityHashMap to ConcurrentHashMap and overwrite equals method of ShardEntry, the shard failed tasks number drops sharply. Now the test cluster can recover to green.

If checking for dups based on task identity is required for some cluster state update tasks? if it's just a bug?

ywelsch · February 12, 2018, 8:18am

Task identity is currently used to establish the relation between requests and responses (for each request there's exactly one response). Deduplicating the shard failed events can and should be done on the sender side (ShardStateAction), which is something that has been on my TODO list already. This will not only reduce the load on the master, but also reduce the amount of messages that have to flow between data and master node. I also want to check if there's anything that can be done to make TaskBatcher more efficient when dealing with a large number of tasks. I'll be off for 2 days now, but will open an issue and start working on a fix by the end of the week.

ginger · February 12, 2018, 9:13am

Thanks very much. Please send us the issue link after you open it, and we will test after you fix it.

ywelsch · February 15, 2018, 10:22am

There's already an open issue that covers this:

We'll start working on a fix.

ginger · February 20, 2018, 2:24am

Thanks. We subscribe this issue to wait for patch.

system · March 20, 2018, 2:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster Hangs for 20 seconds, on a single node crush Elasticsearch	13	891	October 3, 2019
Cluster hanging on node failure Elasticsearch	2	527	July 6, 2017
Long period of querying failure during node timeout Elasticsearch	4	1031	May 15, 2020
Cluster failures Elasticsearch	2	282	July 6, 2017
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	954	July 6, 2017

Half-dead node lead to cluster hang

Related topics