Master node hangs when multiple data nodes are shutdown at the same time

When testing the failure of 5 out of 20 nodes, all master nodes became unresponsive, looking at the logs prior to them becoming unresponsive they show many NodeNotConnectedException for nodes that were NOT shutdown. Why can't a cluster survive multiple node failures?

Thanks

The cluster came back to life in 20 minutes with no data loss. Is there anything that can be done during this down time to?

I think your last comment was cut a little short :slight_smile:

Yeah, thats what happens at 5am.

I was just thinking there must be a way to get the servers to unblock instead of just waiting for timeouts for 20 minutes.

I do understand what happened though, I just don't think that there should ever be a case where the master servers are completely unresponsive and blocking all requests for status.

This might be aws specific, but here is what's going on.

In AWS when using security groups, as soon as a server is shutdown or goes away on its own, all requests to its IP address will now be dropped by the network stack without any replies to the source, because as far as the network stack is concerned that IP address is no longer part of the security group. It seems that elasticsearch doesn't handle that scenario very well. If I just shutdown the elasticsearch software then everything is fine because the host is up, port is closed and the master server gets a response right away. If the host is down it has to wait for multiple timeouts before it considers the host as down, but during that time everything is blocked.

Any thoughts?

Thanks.

Elasticsearch can deal with nodes dropping out, it should be reasonably ok with 25% doing so at once. Mind you that will kick off a fair few cluster updates and subsequent reallocation of shards.

However you can tune the timeouts to respond quicker to such events.

It'd be interesting to see what happened in the logs on the master node when you did this.

That's what I would think, but there was very little in the logs, other than node X is timing out, and the nothing for minutes at a time while the node is inaccessible, then a few more timeouts with an eventual message that said node X is gone, once all the nodes got marked as gone, the cluster started becoming responsive.