Master node hangs when multiple data nodes are shutdown at the same time

emptyemail · May 19, 2015, 7:57am

When testing the failure of 5 out of 20 nodes, all master nodes became unresponsive, looking at the logs prior to them becoming unresponsive they show many NodeNotConnectedException for nodes that were NOT shutdown. Why can't a cluster survive multiple node failures?

Thanks

emptyemail · May 19, 2015, 8:00am

The cluster came back to life in 20 minutes with no data loss. Is there anything that can be done during this down time to?

warkolm · May 20, 2015, 8:00am

I think your last comment was cut a little short

emptyemail · May 20, 2015, 3:46pm

Yeah, thats what happens at 5am.

I was just thinking there must be a way to get the servers to unblock instead of just waiting for timeouts for 20 minutes.

I do understand what happened though, I just don't think that there should ever be a case where the master servers are completely unresponsive and blocking all requests for status.

This might be aws specific, but here is what's going on.

In AWS when using security groups, as soon as a server is shutdown or goes away on its own, all requests to its IP address will now be dropped by the network stack without any replies to the source, because as far as the network stack is concerned that IP address is no longer part of the security group. It seems that elasticsearch doesn't handle that scenario very well. If I just shutdown the elasticsearch software then everything is fine because the host is up, port is closed and the master server gets a response right away. If the host is down it has to wait for multiple timeouts before it considers the host as down, but during that time everything is blocked.

Any thoughts?

Thanks.

warkolm · May 20, 2015, 10:40pm

Elasticsearch can deal with nodes dropping out, it should be reasonably ok with 25% doing so at once. Mind you that will kick off a fair few cluster updates and subsequent reallocation of shards.

However you can tune the timeouts to respond quicker to such events.

It'd be interesting to see what happened in the logs on the master node when you did this.

emptyemail · May 20, 2015, 10:43pm

That's what I would think, but there was very little in the logs, other than node X is timing out, and the nothing for minutes at a time while the node is inaccessible, then a few more timeouts with an eventual message that said node X is gone, once all the nodes got marked as gone, the cluster started becoming responsive.

Topic		Replies	Views
Cluster hanging on node failure Elasticsearch	2	527	July 6, 2017
Elasticsearch cluster instability Elasticsearch	13	2821	July 6, 2017
Node "timeout" possibly due to GC? Elasticsearch	5	776	July 5, 2017
Long period of querying failure during node timeout Elasticsearch	4	1039	May 15, 2020
Cluster availability Elasticsearch	5	349	July 6, 2017

Master node hangs when multiple data nodes are shutdown at the same time

Related topics