I have a 3 node cluster where one Node-1 has a hardware failure and the other two nodes detect the disconnect but do not hold an election and no failover occurs.
Timeline:
- 11:44:00 Node-1 is the master and has a hardware failure and goes down. It has been encountering "ProcessClusterEventTimeoutException: failed to process cluster event" up until this point see [1] for examples. Node-2 and Node-3 also has ProcessClusterEventTimeoutException in their logs up to this point. Node-2 and Node-3 are already seeing ReceiveTimeoutTransportException when talking with Node-1.
- 11:57:47 Node-2 and Node-3 log starts to show "NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected". See [2] for log.
- 12:59 Node-1 recovers but Node-2 and Node-3 still do not have an election
- 13:01 Node-2 and Node-3 ES instances are restarted; Node-2 becomes master; Cluster health goes from RED -> YELLOW -> GREEN See [3] for log.
Node-2 and Node-3 ES instances only hold an election AFTER they are MANUALLY restarted and after having waited more than 1 hour for them to automatically failover (which they did not).
(edit) Even if no failover happens, prometheus exporter is still showing that Node-2 and Node-3 were reporting a cluster status of GREEN while Node-1 was down.
Questions:
- How come Node-2 and Node-3 detect some kind of disconnect/problem from the moment of the Node-1 failure but do not hold an election and do not failover?
- How come it takes 13 minutes (11:44-11:57) for Node-2 and Node-3 to start seeing "NodeNotConnectedException"
- Could the ProcessClusterEventTimeoutException events on all nodes be preventing a failover?
- After Node-1 goes down, many no-timeout queries are made to Node-2 and Node-3 which hang. Is it possible these queries built up and prevented a failover from occurring?
1:
[2024-10-15T11:43:00,998][WARN ][rest.suppressed ] [node-1] path: /news/_settings, params: {master_timeout=30s, index=news, timeout=30s}, status: 503
org.elasticsearch.transport.RemoteTransportException: [node-2][<NODE-2_IP>:9300][indices:admin/settings/update]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[news-8-0/JXXtmZtqSRm2GKNx1-2Gkw]]) within 30s
at org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.1.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:840) ~[?:?]
[2024-10-15T11:43:56,870][WARN ][rest.suppressed ] [Node-1 path: /designer-objects-ia/_settings, params: {master_timeout=30s, index=designer-objects-ia, timeout=30s}, status: 503
org.elasticsearch.transport.RemoteTransportException: [node-2][<NODE-2_IP>:9300][indices:admin/settings/update]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (update-settings [[designer-objects-ia-8-6/jZvLf-kxTGSigmHZrFQI7A]]) within 30s
at org.elasticsearch.cluster.service.MasterService$TaskTimeoutHandler.doRun(MasterService.java:1460) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.14.1.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
at java.lang.Thread.run(Thread.java:840) ~[?:?]
2:
[2024-10-15T11:57:47,914][DEBUG][org.elasticsearch.action.support.nodes.TransportNodesAction] [node-2] failed to execute [cluster:monitor/nodes/stats] on node [{node-1}{UtQcAppJSkO-4BQc4b4avA}{xAjt4736RuaDjmi926JniA}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}{xpack.installed=true}]
org.elasticsearch.transport.NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected
at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:876) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:771) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:757) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:127) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:95) ~[elasticsearch-8.14.1.jar:?]
...
...
[2024-10-15T11:57:47,917][WARN ][org.elasticsearch.cluster.InternalClusterInfoService] [node-2] failed to retrieve stats for node [UtQcAppJSkO-4BQc4b4avA]
org.elasticsearch.transport.NodeNotConnectedException: [node-1][<NODE-1_IP>:9300] Node not connected
at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:283) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:876) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.transport.TransportService.getConnectionOrFail(TransportService.java:771) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:757) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:127) ~[elasticsearch-8.14.1.jar:?]
at org.elasticsearch.action.support.nodes.TransportNodesAction$1.sendItemRequest(TransportNodesAction.java:95) ~[elasticsearch-8.14.1.jar:?]
3:
Log from Node-2 which becomes the new master
[2024-10-15T13:01:33,271][INFO ][org.elasticsearch.cluster.service.MasterService] [node-2] elected-as-master ([3] nodes joined in term 5)[_FINISH_ELECTION_, {node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election, {node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000} completing election], term: 5, version: 68136, delta: master node changed {previous [], current [{node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}]}, added {{node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}}
[2024-10-15T13:01:33,468][INFO ][org.elasticsearch.cluster.service.ClusterApplierService] [node-2] master node changed {previous [], current [{node-2}{tNTw-p6NSnSJxY968nPYZw}{VKtjkdECT12KNNSUWslwVw}{node-2}{node-2}{<NODE-2_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}]}, added {{node-1}{kzk-51K-ThKg7uwQZ9as2g}{_BGkbFdMS5CFULwIZrHBAw}{node-1}{node-1}{<NODE-1_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}, {node-3}{6LpgSROnTNmGmzzlCfofZQ}{JPvvOvJzQW6V51A954agpw}{node-3}{node-3}{<NODE-3_IP>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}}, term: 5, version: 68136, reason: Publication{term=5, version=68136}
[
...
...
024-10-15T13:01:41,829][INFO ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] current.health="YELLOW" message="Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[user-activity-8-0][0]]])." previous.health="RED" reason="shards started [[user-activity-8-0][0]]"
[2024-10-15T13:01:41,864][WARN ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] [user-activity-8-0][0] marking unavailable shards as stale: [MsnZMCyIRnmjkxcnhEtO-A]
[2024-10-15T13:01:42,186][WARN ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] [user-activity-8-0][0] marking unavailable shards as stale: [9aeL3u_KQYOsqSSdgI6b1g]
[2024-10-15T13:01:44,994][INFO ][org.elasticsearch.cluster.routing.allocation.AllocationService] [node-2] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[user-activity-8-0][0]]])." previous.health="YELLOW" reason="shards started [[user-activity-8-0][0]]"