Node went down but other cluster nodes reporting green and no decrease in # nodes

buitcj · October 16, 2024, 10:07pm

I have an environment with a 3 node cluster and the cluster/health endpoint showed 3 nodes and GREEN. After one node came down and remained down for 20 minutes, the two remaining nodes continued to report 3 nodes and GREEN cluster. Under what circumstances would this be possible? How can I explain this? Details follow...

One node had a hardware failure and went completely down for 20 minutes, but my two prometheus metrics exporter instances that communicate with the two remaining nodes still report that there are 3 nodes in the cluster and that the cluster health is GREEN.

The two remaining nodes have this type of error in the logs, which indicate that they cannot find the previously-existing node:

[2024-10-15T12:07:18,168][DEBUG][org.elasticsearch.action.support.nodes.TransportNodesAction] [<REDACTED>] failed to execute [cluster:monitor/nodes/stats] on node [{<REDACTED>}{UtQcAppJSkO-4BQc4b4avA}{xAjt4736RuaDjmi926JniA}{<REDACTED>}{<REDACTED>}{<REDACTED>:9300}{cdfhilmrstw}{8.14.1}{7000099-8505000}{xpack.installed=true}]
org.elasticsearch.transport.NodeNotConnectedException: [<REDACTED>][<REDACTED>:9300] Node not connected

I also see this type of error from the two remaining nodes which seems to make sense because with the master_timeout property, it should be communicating with the master (which I think is the one that came down):

{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update-settings [[<REDACTED>/M61enBc0Q2afJR2etjvFgg]]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update-settings [[<REDACTED>/M61enBc0Q2afJR2etjvFgg]]) within 30s"},"status":503}
                at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)
...
                Suppressed: org.elasticsearch.client.ResponseException: method [PUT], host [http://<REDACTED>:9200], URI [/<REDACTED>/_settings?master_timeout=30s&timeout=30s], status line [HTTP/1.1 503 Service Unavailable]
{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update-settings [[<REDACTED>/M61enBc0Q2afJR2etjvFgg]]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (update-settings [[<REDACTED>/M61enBc0Q2afJR2etjvFgg]]) within 30s"},"status":503}

All indications say that the two remaining nodes should be aware that the cluster no longer has 3 nodes, but that's not what they report.

Please help me explain this.

buitcj · October 17, 2024, 9:03pm

Is the process_cluster_event_timeout_exception preventing the nodes from truly going from GREEN to YELLOW?

Carlos_D · October 23, 2024, 7:31am

From Elastic Search to Elasticsearch

system · November 20, 2024, 7:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Node fails but cluster holds no election and no failover occurs Elasticsearch	23	81	November 9, 2024
3 nodes, replicas=2, entire cluster goes down after losing one node? Elasticsearch	5	336	July 6, 2017
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	954	July 6, 2017
2 Nodes ES cluster becomes unavailable for 2 -3 mins if one node (master) goes down Elasticsearch	11	3674	July 5, 2017
Cluster intermittently goes down Elasticsearch	1	434	September 17, 2018

Node went down but other cluster nodes reporting green and no decrease in # nodes

Related topics