Node fails but cluster holds no election and no failover occurs

Being honest, I'm not sure what happened with your cluster.

Looking at the logs you shared there is no indication that the Node-1 left the cluster, you have a lot of logs about timeouts while trying to connect to Node-1 and about Node-1 being disconnected, but there is no log informing when and if the Node-1 left the cluster.

For example, I started a 3 node cluster with docker and kille ond of the containers and both the master and the other node logged about it.

On the master node I would have a node left log like this:

node-left[{es02}{3PFydHiiQfqHKQWslFD7RQ}{YlyuLUXWTdWE8VYvdI4RNQ}{es02}{192.168.10.4}{192.168.10.4:9300}{cdfhilmrstw}{8.15.2}{7000099-8512000} reason: disconnected], term: 1, version: 156, delta: removed {{es02}{3PFydHiiQfqHKQWslFD7RQ}{YlyuLUXWTdWE8VYvdI4RNQ}{es02}{192.168.10.4}{192.168.10.4:9300}{cdfhilmrstw}{8.15.2}{7000099-8512000}}

Also in the master node another node left like this would be logged:

node-left: [{es02}{3PFydHiiQfqHKQWslFD7RQ}{YlyuLUXWTdWE8VYvdI4RNQ}{es02}{192.168.10.4}{192.168.10.4:9300}{cdfhilmrstw}{8.15.2}{7000099-8512000}] with reason [disconnected]

In the other remaining node I would have a removed log like this:

removed {{es02}{3PFydHiiQfqHKQWslFD7RQ}{YlyuLUXWTdWE8VYvdI4RNQ}{es02}{192.168.10.4}{192.168.10.4:9300}{cdfhilmrstw}{8.15.2}{7000099-8512000}}, term: 1, version: 156, reason: ApplyCommitRequest{term=1, version=156, sourceNode={es03}{f-QU7JZAR8SOClUw6hoPoQ}{RViNXl-ISiyqO1bSWJjWzQ}{es03}{192.168.10.5}{192.168.10.5:9300}{cdfhilmrstw}{8.15.2}{7000099-8512000}{xpack.installed=true, ml.machine_memory=1073741824, ml.allocated_processors=8, ml.allocated_processors_double=8.0, ml.max_jvm_size=536870912, ml.config_version=12.0.0, transform.config_version=10.0.0}}

But there are no node left or removed lines in none of the logs you shared.

Another thing is, when you mean that the Node-1 was replaced, the entire machine, including the Elasticsearch installation was replaced, right? Because there are 2 different ids for the Node 1 in your logs.

[2024-10-15T12:59:57,480][WARN ][org.elasticsearch.cluster.NodeConnectionsService] [Node-2] failed to connect to {Node-1}{UtQcAppJSkO-4BQc4b4avA}{xAjt4736RuaDjmi926JniA}{Node-1}{Node-1}{:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}{xpack.installed=true} (tried [331] times)
org.elasticsearch.transport.ConnectTransportException: [Node-1][:9300] handshake failed. unexpected remote node {Node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{Node-1}{Node-1}{:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}{xpack.installed=true}

I would assume that your Node-1 may have had some issue, but the service was stil answering some communications which not triggered it to leave the cluster, but I'm not sure if this is possible and in which scenario could this happen.

Unfortunately I do not know the internals of Elasticsearch to provide further help, and without all the logs from all the three nodes, I don't think you will be able to find watch may be the issue.

If your cluster is running ok right now I would disable the DEBUG logs, they are not required per default, and keep those logs for more time in the case something like this happens again and you need to troubleshoot.

2 Likes

@leandrojmp and Christian, thank you for taking an earnest look at this. I'm always surprised by how quickly ES folks are to respond to forum questions even though (afaik) you have no obligation to help.

Another thing is, when you mean that the Node-1 was replaced, the entire machine, including the Elasticsearch installation was replaced, right?

yes, per support staff, the hardware failure detected prompted them to create a "replacement node to be booted up on fresh hardware"


Yeah on our end we also do our own failover and disconnect testing, and we usually see logging from the following: ClusterConnectionManager, ClusterApplierService, AllocationService, NodeLeftExecutor, NodeJoinExecutor. We also normally see cluster health degrade and the always see the # of nodes reported by the cluster health api show an accurate count.