Half-dead node lead to cluster hang

ywelsch · February 9, 2018, 2:10pm

That's definitely an interesting theory. The stack trace indicates that the amount of "shard failed" events that are handled on the master look to overwhelm it and block network threads. The logs from the production cluster are inconclusive, they show that the master node has trouble reconnecting to the rest of the cluster, but there are also minutes going by without any log information. In case where this happens again in production, can you take stack dumps / heap dumps of the master?

Can you also provide more information on all the steps you took to reproduce this, in particular the ones which led to the 106945 shard-failed events in the test cluster? Thanks.

Topic		Replies	Views
Cluster Hangs for 20 seconds, on a single node crush Elasticsearch	13	895	October 3, 2019
Cluster hanging on node failure Elasticsearch	2	527	July 6, 2017
Long period of querying failure during node timeout Elasticsearch	4	1062	May 15, 2020
Cluster failures Elasticsearch	2	284	July 6, 2017
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	955	July 6, 2017

Half-dead node lead to cluster hang

Related topics