I have an environment with a 3 node cluster and the cluster/health endpoint showed 3 nodes and GREEN. After one node came down and remained down for 20 minutes, the two remaining nodes continued to report 3 nodes and GREEN cluster. Under what circumstances would this be possible? How can I explain this? Details follow...
One node had a hardware failure and went completely down for 20 minutes, but my two prometheus metrics exporter instances that communicate with the two remaining nodes still report that there are 3 nodes in the cluster and that the cluster health is GREEN.
The two remaining nodes have this type of error in the logs, which indicate that they cannot find the previously-existing node:
[2024-10-15T12:07:18,168][DEBUG][org.elasticsearch.action.support.nodes.TransportNodesAction] [<REDACTED>] failed to execute [cluster:monitor/nodes/stats] on node [{<REDACTED>}{UtQcAppJSkO-4BQc4b4avA}{xAjt4736RuaDjmi926JniA}{<REDACTED>}{<REDACTED>}{<REDACTED>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}{xpack.installed=true}]
org.elasticsearch.transport.NodeNotConnectedException: [<REDACTED>][<REDACTED>:9300] Node not connected
I also see this type of error from the two remaining nodes which seems to make sense because with the master_timeout
property, it should be communicating with the master (which I think is the one that came down):
{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update-settings [[<REDACTED>/M61enBc0Q2afJR2etjvFgg]]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update-settings [[<REDACTED>/M61enBc0Q2afJR2etjvFgg]]) within 30s"},"status":503}
at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)
...
Suppressed: org.elasticsearch.client.ResponseException: method [PUT], host [http://<REDACTED>:9200], URI [/<REDACTED>/_settings?master_timeout=30s&timeout=30s], status line [HTTP/1.1 503 Service Unavailable]
{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update-settings [[<REDACTED>/M61enBc0Q2afJR2etjvFgg]]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update-settings [[<REDACTED>/M61enBc0Q2afJR2etjvFgg]]) within 30s"},"status":503}
All indications say that the two remaining nodes should be aware that the cluster no longer has 3 nodes, but that's not what they report.
Please help me explain this.