Node fails but cluster holds no election and no failover occurs

Being honest, I'm not sure what happened with your cluster.

Looking at the logs you shared there is no indication that the Node-1 left the cluster, you have a lot of logs about timeouts while trying to connect to Node-1 and about Node-1 being disconnected, but there is no log informing when and if the Node-1 left the cluster.

For example, I started a 3 node cluster with docker and kille ond of the containers and both the master and the other node logged about it.

On the master node I would have a node left log like this:

node-left[{es02}{3PFydHiiQfqHKQWslFD7RQ}{YlyuLUXWTdWE8VYvdI4RNQ}{es02}{192.168.10.4}{192.168.10.4:9300}{cdfhilmrstw}{8.15.2}{7000099-8512000} reason: disconnected], term: 1, version: 156, delta: removed {{es02}{3PFydHiiQfqHKQWslFD7RQ}{YlyuLUXWTdWE8VYvdI4RNQ}{es02}{192.168.10.4}{192.168.10.4:9300}{cdfhilmrstw}{8.15.2}{7000099-8512000}}

Also in the master node another node left like this would be logged:

node-left: [{es02}{3PFydHiiQfqHKQWslFD7RQ}{YlyuLUXWTdWE8VYvdI4RNQ}{es02}{192.168.10.4}{192.168.10.4:9300}{cdfhilmrstw}{8.15.2}{7000099-8512000}] with reason [disconnected]

In the other remaining node I would have a removed log like this:

removed {{es02}{3PFydHiiQfqHKQWslFD7RQ}{YlyuLUXWTdWE8VYvdI4RNQ}{es02}{192.168.10.4}{192.168.10.4:9300}{cdfhilmrstw}{8.15.2}{7000099-8512000}}, term: 1, version: 156, reason: ApplyCommitRequest{term=1, version=156, sourceNode={es03}{f-QU7JZAR8SOClUw6hoPoQ}{RViNXl-ISiyqO1bSWJjWzQ}{es03}{192.168.10.5}{192.168.10.5:9300}{cdfhilmrstw}{8.15.2}{7000099-8512000}{xpack.installed=true, ml.machine_memory=1073741824, ml.allocated_processors=8, ml.allocated_processors_double=8.0, ml.max_jvm_size=536870912, ml.config_version=12.0.0, transform.config_version=10.0.0}}

But there are no node left or removed lines in none of the logs you shared.

Another thing is, when you mean that the Node-1 was replaced, the entire machine, including the Elasticsearch installation was replaced, right? Because there are 2 different ids for the Node 1 in your logs.

[2024-10-15T12:59:57,480][WARN ][org.elasticsearch.cluster.NodeConnectionsService] [Node-2] failed to connect to {Node-1}{UtQcAppJSkO-4BQc4b4avA}{xAjt4736RuaDjmi926JniA}{Node-1}{Node-1}{:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}{xpack.installed=true} (tried [331] times)
org.elasticsearch.transport.ConnectTransportException: [Node-1][:9300] handshake failed. unexpected remote node {Node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{Node-1}{Node-1}{:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}{xpack.installed=true}

I would assume that your Node-1 may have had some issue, but the service was stil answering some communications which not triggered it to leave the cluster, but I'm not sure if this is possible and in which scenario could this happen.

Unfortunately I do not know the internals of Elasticsearch to provide further help, and without all the logs from all the three nodes, I don't think you will be able to find watch may be the issue.

If your cluster is running ok right now I would disable the DEBUG logs, they are not required per default, and keep those logs for more time in the case something like this happens again and you need to troubleshoot.

2 Likes

@leandrojmp and Christian, thank you for taking an earnest look at this. I'm always surprised by how quickly ES folks are to respond to forum questions even though (afaik) you have no obligation to help.

Another thing is, when you mean that the Node-1 was replaced, the entire machine, including the Elasticsearch installation was replaced, right?

yes, per support staff, the hardware failure detected prompted them to create a "replacement node to be booted up on fresh hardware"


Yeah on our end we also do our own failover and disconnect testing, and we usually see logging from the following: ClusterConnectionManager, ClusterApplierService, AllocationService, NodeLeftExecutor, NodeJoinExecutor. We also normally see cluster health degrade and the always see the # of nodes reported by the cluster health api show an accurate count.

Just to reply here to your request for more help on Github: I don't think we can do any more investigation than Leandro and Christian have already done. We'd need full logs from all three nodes covering the entire outage and ideally a little either side too. All the stuff about ProcessClusterEventTimeoutException and so on that you've shared above aren't really relevant. AIUI those logs are not available so we can only guess, and my guess is that the failure of node-1 was not a complete failure, at least not initially, and it kept on reporting itself as a healthy member of the cluster to the other nodes in order to prevent them from removing it from the cluster.

Incomplete hardware failures are really hard to detect from within the system (impossible in general) and ES makes basically no effort to do so. Instead it expects there to be some external mechanism to detect the problem and terminate the node ASAP.

It is perhaps worth noting that the logs you shared indicate node-1 wasn't the master at the time anyway - the RemoteTransportException wrapper indicates that the node failing to process the cluster event (i.e. the elected master node) was node-2.

It's perhaps also worth noting that removing a node from the cluster is itself a task for the master, and if the master were struggling to process tasks in a timely fashion for some reason then it would also struggle to remove a node even after it shut down. However there's nothing in the conversation above that might suggest why node-2 was struggling.

1 Like

Thanks @DavidTurner , I agree with what you're saying and I'd like to add that in this case queries on the remaining nodes basically got stuck and never completed. Basically the HA cluster became unavailable due to a single node going down - the problem is worse than just the cluster reporting an inaccurate number of available nodes.

I also found that there were pending tasks that never completed for up to 24 hours before the actual hardware failure was reported. Perhaps that prevented cluster-state-updating tasks to proceed, preventing a proper detection of the failure? Just a guess. If I'm not mistaken, the tasks API only contains recent tasks so I'm not sure we'll be able to recover the task history to see what was pending at the time of the failure.

the failure of node-1 was not a complete failure, at least not initially, and it kept on reporting itself as a healthy member of the cluster to the other nodes in order to prevent them from removing it from the cluster.

the entire ec2 node went down for an hour, and the nodes 2 and 3 even detected the disconnect, so I don't see how node 1 could've reported itself as healthy