Node went down but other cluster nodes reporting green and no decrease in # nodes

I have an environment with a 3 node cluster and the cluster/health endpoint showed 3 nodes and GREEN. After one node came down and remained down for 20 minutes, the two remaining nodes continued to report 3 nodes and GREEN cluster. Under what circumstances would this be possible? How can I explain this? Details follow...

One node had a hardware failure and went completely down for 20 minutes, but my two prometheus metrics exporter instances that communicate with the two remaining nodes still report that there are 3 nodes in the cluster and that the cluster health is GREEN.

The two remaining nodes have this type of error in the logs, which indicate that they cannot find the previously-existing node:

[2024-10-15T12:07:18,168][DEBUG][org.elasticsearch.action.support.nodes.TransportNodesAction] [<REDACTED>] failed to execute [cluster:monitor/nodes/stats] on node [{<REDACTED>}{UtQcAppJSkO-4BQc4b4avA}{xAjt4736RuaDjmi926JniA}{<REDACTED>}{<REDACTED>}{<REDACTED>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}{xpack.installed=true}]
org.elasticsearch.transport.NodeNotConnectedException: [<REDACTED>][<REDACTED>:9300] Node not connected

I also see this type of error from the two remaining nodes which seems to make sense because with the master_timeout property, it should be communicating with the master (which I think is the one that came down):

{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update-settings [[<REDACTED>/M61enBc0Q2afJR2etjvFgg]]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update-settings [[<REDACTED>/M61enBc0Q2afJR2etjvFgg]]) within 30s"},"status":503}
                at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)
...
                Suppressed: org.elasticsearch.client.ResponseException: method [PUT], host [http://<REDACTED>:9200], URI [/<REDACTED>/_settings?master_timeout=30s&timeout=30s], status line [HTTP/1.1 503 Service Unavailable]
{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update-settings [[<REDACTED>/M61enBc0Q2afJR2etjvFgg]]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update-settings [[<REDACTED>/M61enBc0Q2afJR2etjvFgg]]) within 30s"},"status":503}

All indications say that the two remaining nodes should be aware that the cluster no longer has 3 nodes, but that's not what they report.

Please help me explain this.

Is the process_cluster_event_timeout_exception preventing the nodes from truly going from GREEN to YELLOW?

From Elastic Search to Elasticsearch

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.