Half-dead node lead to cluster hang

That's definitely an interesting theory. The stack trace indicates that the amount of "shard failed" events that are handled on the master look to overwhelm it and block network threads. The logs from the production cluster are inconclusive, they show that the master node has trouble reconnecting to the rest of the cluster, but there are also minutes going by without any log information. In case where this happens again in production, can you take stack dumps / heap dumps of the master?

Can you also provide more information on all the steps you took to reproduce this, in particular the ones which led to the 106945 shard-failed events in the test cluster? Thanks.