Node fails but cluster holds no election and no failover occurs

I have a 3 node cluster where one Node-1 has a hardware failure and the other two nodes detect the disconnect but do not hold an election and no failover occurs.

Timeline:

  • 11:44:00 Node-1 is the master and has a hardware failure and goes down. It has been encountering "ProcessClusterEventTimeoutException: failed to process cluster event" up until this point see [1] for examples. Node-2 and Node-3 also has ProcessClusterEventTimeoutException in their logs up to this point. Node-2 and Node-3 are already seeing ReceiveTimeoutTransportException when talking with Node-1.
  • 11:57:47 Node-2 and Node-3 log starts to show "NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected". See [2] for log.
  • 12:59 Node-1 recovers but Node-2 and Node-3 still do not have an election
  • 13:01 Node-2 and Node-3 ES instances are restarted; Node-2 becomes master; Cluster health goes from RED -> YELLOW -> GREEN See [3] for log.

Node-2 and Node-3 ES instances only hold an election AFTER they are MANUALLY restarted and after having waited more than 1 hour for them to automatically failover (which they did not).

(edit) Even if no failover happens, prometheus exporter is still showing that Node-2 and Node-3 were reporting a cluster status of GREEN while Node-1 was down.

Questions:

  • How come Node-2 and Node-3 detect some kind of disconnect/problem from the moment of the Node-1 failure but do not hold an election and do not failover?
  • How come it takes 13 minutes (11:44-11:57) for Node-2 and Node-3 to start seeing "NodeNotConnectedException"
  • Could the ProcessClusterEventTimeoutException events on all nodes be preventing a failover?
  • After Node-1 goes down, many no-timeout queries are made to Node-2 and Node-3 which hang. Is it possible these queries built up and prevented a failover from occurring?

1:

[2024-10-15T11:43:00,998][WARN ][rest.suppressed          ] [node-1] path: /news/_settings, params: {master_timeout=30s, index=news, timeout=30s}, status: 503
org.elasticsearch.transport.RemoteTransportException: [node-2][<NODE-2_IP>:9300][indices:admin/settings/update]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[news-8-0/JXXtmZtqSRm2GKNx1-2Gkw]]) within 30s
        at org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.1.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
[2024-10-15T11:43:56,870][WARN ][rest.suppressed          ] [Node-1 path: /designer-objects-ia/_settings, params: {master_timeout=30s, index=designer-objects-ia, timeout=30s}, status: 503
org.elasticsearch.transport.RemoteTransportException: [node-2][<NODE-2_IP>:9300][indices:admin/settings/update]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[designer-objects-ia-8-6/jZvLf-kxTGSigmHZrFQI7A]]) within 30s
        at org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.1.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]

2:

[2024-10-15T11:57:47,914][DEBUG][org.elasticsearch.action.support.nodes.TransportNodesAction] [node-2] failed to execute [cluster:monitor/nodes/stats] on node [{node-1}{UtQcAppJSkO-4BQc4b4avA}{xAjt4736RuaDjmi926JniA}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}{xpack.installed=true}]
org.elasticsearch.transport.NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected
        at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:876) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:771) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:757) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:127) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:95) ~[elasticsearch-8.14.1.jar:?]
...
...
[2024-10-15T11:57:47,917][WARN ][org.elasticsearch.cluster.InternalClusterInfoService] [node-2] failed to retrieve stats for node [UtQcAppJSkO-4BQc4b4avA]
org.elasticsearch.transport.NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected
        at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:876) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:771) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:757) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:127) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:95) ~[elasticsearch-8.14.1.jar:?]

3:
Log from Node-2 which becomes the new master

[2024-10-15T13:01:33,271][INFO ][org.elasticsearch.cluster.service.MasterService] [node-2] elected-as-master ([3] nodes joined in term 5)[_FINISH_ELECTION_, {node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election, {node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election], term: 5, version: 68136, delta: master node changed {previous [], current [{node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}]}, added {{node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}}
[2024-10-15T13:01:33,468][INFO ][org.elasticsearch.cluster.service.ClusterApplierService] [node-2] master node changed {previous [], current [{node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}]}, added {{node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}}, term: 5, version: 68136, reason: Publication{term=5, version=68136}
[
...
...
024-10-15T13:01:41,829][INFO ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] current.health="YELLOW" message="Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[user-activity-8-0][0]]])." previous.health="RED" reason="shards started [[user-activity-8-0][0]]"
[2024-10-15T13:01:41,864][WARN ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] [user-activity-8-0][0] marking unavailable shards as stale: [MsnZMCyIRnmjkxcnhEtO-A]
[2024-10-15T13:01:42,186][WARN ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] [user-activity-8-0][0] marking unavailable shards as stale: [9aeL3u_KQYOsqSSdgI6b1g]
[2024-10-15T13:01:44,994][INFO ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[user-activity-8-0][0]]])." previous.health="YELLOW" reason="shards started [[user-activity-8-0][0]]"

cluster.fault_detection.leader_check.interval
cluster.fault_detection.leader_check.timeout
cluster.fault_detection.leader_check.retry_count

have been untouched, so they should be their default of 1s, 10s, and 3, respectively, from the docs

If the current master is still perent in the cluster and was not the node that failed and there is a majority of master-eligible nodes still available to support this there is no need for a new election. The current master node can just continue with its work.

This removed the active master and therefore foces a new election.

Were the shards that were present on node-1 eventually relocated to the available nodes? That is what I would expect to happen during failover.

Thank you for the response.

If the current master is still perent in the cluster and was not the node that failed and there is a majority of master-eligible nodes still available to support this there is no need for a new election. The current master node can just continue with its work.

In this case Node-1 (edit: now unclear if node-1 was the master) is also the node that failed. In this situation, there should be an election, right?

And whether or not Node-1 was the master, I would've expected the other nodes to report non-GREEN cluster health, but our exporter shows that they've reported green, nearly throughout, which is consistent with the logs on all three nodes, which indicate that the cluster health only changed like 1 hour after the failure.

No, basically an election will be triggered when the current master node leaves the cluster and if there are enough master eligible nodes left, they will vote for the new master node.

Since you the Node-1 left the cluster and an election was not triggered, then it wasn't the current master node.

The cluster status is related to the allocation of the shards in the cluster, if all primary and replicas are allocated, it will report as green.

Depending on your indices and their primary and replicas configuration it is possible to lost a node and still have a green cluster (after a transition from yellow).

So, what you described looks like an expected behavior.

1 Like

@leandrojmp both Node-2 and Node-3 are reporting 3 nodes in the cluster, via prometheus metrics during the ~1 hour the Node-1 was down.

Also, during our internal testing we've always seen cluster health turn YELLOW when a node goes down, so although what you say about GREEN only meaning there aren't unallocated shards might be true, experience and anecdotal evidence would tell me that this is unlikely to have been the case

What was the hardware failure you had? Was the elasticsearch service down?

You would need to provide the complete logs from this entire timeline for all your 3 nodes to troubleshoot this, the ones you shared are not complete.

I do not use prometheus, so I have no idea how it get the metrics, so not sure why it was reporting the wrong number of nodes.

Sorry I forgot I had this information too: Support staff captured their cluster health directly (i.e., we can ignore prometheus metrics collection here)

This was taken from Node-2 at the time while Node-1 was down.

$  curl -sS localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "my-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 13,
  "active_shards" : 37,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 6,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 82683358,
  "active_shards_percent_as_number" : 100.0
}

What was the hardware failure you had? Was the elasticsearch service down?

I am told we do not have insight into what exact hardware failure was present on aws at the time of the failure, but the entire Node-1's ec2 instance was down (not just elasticsearch server instance) and only came back up when the node was 'replaced'.

The "service" was down in the sense that no es queries seem to be getting executed.

You would need to provide the complete logs from this entire timeline for all your 3 nodes to troubleshoot this, the ones you shared are not complete.

I unfortunately can't provide these due to the sensitive information on them. Nor do I even have all of them, because I'd need permission from my customer to get them.

The logs indicate* that before the failure, Node-2 was the last known master, so it's possible that after Node-1 went down, the cluster could remain green. But I think this would also mean all the shards would necessarily be on Node-2 and Node-3, otherwise there would be shards that are not active.

*We lost one of the node's logs due to them rolling over too quickly at the time, so we're not 100% sure of this statement, but it seems likely.


Ignoring the slim possibility that all shards were on Node-2, and -3,

I still need an explanation for why the remaining nodes were reporting 3 nodes in the cluster for roughly 1 hour while Node-1 was down.

It depends, if all index are configured to have replicas, new replicas may be created and it brifely transitioned from yellow to green.

WIthout all the logs from the period when this happened I don't think that it is possible to say what may happened.

Are all of the nodes running exactly the same version of Elasticsearch?

Yes, they were all running 8.14.1. It is not possible for our installations to contain different-versioned instances at the same time.

You would need to provide the complete logs from this entire timeline for all your 3 nodes to troubleshoot this, the ones you shared are not complete.

@leandrojmp If we focus on the question of how come Node-2 didn't detect a Node-1 failure and why it didn't begin to report only 2 nodes in the cluster via the cluster health endpoint, then we can just look at the time period from the hardware failure and the next minutes afterwards. cluster.fault_detection.follower_check* and cluster.fault_detection.leader_check* settings haven't been modified so a failure should be detected within 30s. Here are the complete logs from the time period 11:43-12:00. Again, the hardware failure occurred at 11:44. I included Node-1's logs up till 12:58 to show you that there was no other output while it was down.

You may ask why I'm not showing Node-3's logs. a) Node-3's logs rolled over too quickly so we don't have this time period and b) If we scope the question to why didn't Node-2 didn't detect a Node-1 failure (again, Node-2's cluster health is reporting 3 nodes in the cluster and GREEN status) during this time, then we don't really even need Node-3 logs.

Node-2's logs shows that it knows about Node-1's disconnect for as long as Node-1 is down. As far as I can tell, it should not be reporting 3 nodes in the cluster.

Again, the remaining question is why Node-2 (and Node-3) were reporting 3 nodes in the cluster while Node-1 was down.

Node-1's logs:

[2024-10-15T11:43:00,998][WARN ][rest.suppressed          ] [Node-1] path: /news/_settings, params: {master_timeout=30s, index=news, timeout=30s}, status: 503
org.elasticsearch.transport.RemoteTransportException: [Node-2][<Node-2-IP>:9300][indices:admin/settings/update]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[news-8-0/JXXtmZtqSRm2GKNx1-2Gkw]]) within 30s
        at org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.1.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
[2024-10-15T11:43:56,870][WARN ][rest.suppressed          ] [Node-1] path: /designer-objects-ia/_settings, params: {master_timeout=30s, index=designer-objects-ia, timeout=30s}, status: 503
org.elasticsearch.transport.RemoteTransportException: [Node-2][<Node-2-IP>:9300][indices:admin/settings/update]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[designer-objects-ia-8-6/jZvLf-kxTGSigmHZrFQI7A]]) within 30s
        at org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.1.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.1.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
        at java.lang.Thread.run(Thread.java:840) ~[?:?]
[2024-10-15T12:58:45,811][WARN ][org.elasticsearch.nativeaccess.NativeAccess] [Node-1] Unable to load native provider. Native methods will be disabled.

Node-2's logs found here (too long to paste in this thread)

Is the log you shared complete or edited?

I do not see anything related to node-1 lefting the cluster and also nothing about node-1 joining the cluster back.

Also, no log lines about the state of the cluster moving from GREEN to YELLOW, or RED.

As mentioned, without the full logs, at least for some of the nodes, it is impossible to try to troubleshoot it.

There is no information when the node left the cluster, no information of the status of the cluster, no information when the node joined the cluster back.

You may have hit a weird bug or something like that, but I don't think we have enough information to troubleshoot it, if you can share the entire log, then we may be able to look at it and try to understand what may have happened, but without it I don't think there is anything else to add.

Also, keep in mind that this forum is not official support and there is no SLA.

1 Like

@leandrojmp

Thank you for helping. Here are the full logs from 11:43 to 13:02. The reason they were not shared earlier is because I wanted you all to focus on the issue you could not refute, which is that the nodes are inaccurately reporting 3 nodes in the cluster (and larger logs are harder to scrub and prove that they're scrubbed properly). This new window of logs includes the the full timeline as you want.

filtered2 - Google Drive (I will remove access to this link later)

The only modifications made were to remove node/host names as well as IPs

The logs from node-3 only seem to start at 2024-10-15T12:59:15,485, so the window of interest is not covered.

@Christian_Dahlqvist Yes, as mentioned, Node-3's logs rolled over too quickly so we don't have this time period.

But more importantly, this should not impact Node-2's ability to detect a problem with Node-1.

But where they removed? Per default the logs are not removed, just gzipped on the server.

The logs were not removed by customer or our support staff. There are gzipped files in the log dir but none of them are recently modified and none of the non-gzipped files have any log entries from the day of the problem before 13:00

I am requesting an environment that'll restore to a snapshot of the ec2 node on the day AFTER the problem to see if the logs are available from the snapshot, but it might be a while before I get back to you.

Update: confirmed that even if we were to rollback to the snapshot, we wouldn't have the logs you're looking for. We are using a rolling file appender and in this case, all 10 of our files are just flooded with messages during the 12:00-13:00 period which is why we do not have Node-3's logs from before or around 11:54.

I still think if the question is "why the Node-1 disconnect is not reflected in Node-2's cluster health status to show a missing node", then we don't need to be so concerned with Node-3's logs.

appender.rolling.type = RollingFile
appender.rolling.name = rolling
appender.rolling.fileName = ${sys:es.logs.base_path}${sys:file.separator}${logFilePrefix}.log
appender.rolling.filePattern = ${sys:es.logs.base_path}${sys:file.separator}${logFilePrefix}-%i.log
appender.rolling.layout.type = PatternLayout
appender.rolling.layout.pattern = [%d{ISO8601}][%-5p][%-25c] [%node_name]%marker %.-10000m%n
appender.rolling.policies.type = Policies
appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
appender.rolling.policies.size.size = 10MB
appender.rolling.strategy.type = DefaultRolloverStrategy
appender.rolling.strategy.max = 10