Unable to form cluster after half of cluster nodes was removed - ES 7.1

plundgren · June 28, 2019, 2:15pm

Hi,

Had six nodes (all master eligible) running in our cluster and then three nodes was removed, volumes gone. The three nodes that was removed were excluded from shard allocation with setting:

"cluster.routing.allocation.exclude._ip"

and had only one shard per node, belonging to the _security index.

After the nodes was removed this failure arose (from log):

elasticsearch security index is unavailable short circuiting retrieval of user

Settings for cluster discovery were as follows (legacy from v6):

discovery.zen.ping.unicast.hosts: 2,
discovery.zen.minimum_master_nodes: "192.168.50.80:9300, 192.168.50.81:9300, 192.168.50.83:9300"

My issue now is that the cluster wont elect a master since the requirements aren't fulfilled, from log on node es-1:

[<time..>][WARN ][o.e.c.c.ClusterFormationFailureHelper] [es-1] master not discovered or elected yet, an election requires at least 3 nodes with ids from [2IE4RVpNTfKpL5JQsbvPCQ, 4womI-u8TMS_lxDytQ0kGg, 8DDEaYnnQKGkK9ea2klnnw, DuIW_Q65QFeZWOzwcyKlXA, TxpUfaPDTrSCngTNuZ_Brg] and at least 2 nodes with ids from [4womI-u8TMS_lxDytQ0kGg, 8DDEaYnnQKGkK9ea2klnnw, TxpUfaPDTrSCngTNuZ_Brg], have discovered [{es-2}{TxpUfaPDTrSCngTNuZ_Brg}{X0wUWTMrTLm1tDE2JMHZ2A}{192.168.50.81}{192.168.50.81:9300}{xpack.installed=true}, {es-4}{8DDEaYnnQKGkK9ea2klnnw}{EtBJ1BycTBCTIANy3Pp_MA}{192.168.50.83}{192.168.50.83:9300}{xpack.installed=true}] which is not a quorum; discovery will continue using [192.168.50.81:9300, 192.168.50.83:9300] from hosts providers and [{es-1}{Vmau018eQWO3AjMzSuo8sQ}{3Ri3q7cvQEGV-p11ykEkCA}{192.168.50.80}{192.168.50.80:9300}{xpack.installed=true}] from last-known cluster state; node term 1087, last-accepted version 125480 in term 11

Same for node es-2:

[<time...>][WARN ][o.e.c.c.ClusterFormationFailureHelper] [es-2] master not discovered or elected yet, an election requires at least 3 nodes with ids from [2IE4RVpNTfKpL5JQsbvPCQ, 4womI-u8TMS_lxDytQ0kGg, 8DDEaYnnQKGkK9ea2klnnw, DuIW_Q65QFeZWOzwcyKlXA, TxpUfaPDTrSCngTNuZ_Brg] and at least 2 nodes with ids from [4womI-u8TMS_lxDytQ0kGg, 8DDEaYnnQKGkK9ea2klnnw, TxpUfaPDTrSCngTNuZ_Brg], have discovered [{es-1}{Vmau018eQWO3AjMzSuo8sQ}{3Ri3q7cvQEGV-p11ykEkCA}{192.168.50.80}{192.168.50.80:9300}{xpack.installed=true}, {es-4}{8DDEaYnnQKGkK9ea2klnnw}{EtBJ1BycTBCTIANy3Pp_MA}{192.168.50.83}{192.168.50.83:9300}{xpack.installed=true}] which is not a quorum; discovery will continue using [192.168.50.80:9300, 192.168.50.83:9300] from hosts providers and [{es-2}{TxpUfaPDTrSCngTNuZ_Brg}{X0wUWTMrTLm1tDE2JMHZ2A}{192.168.50.81}{192.168.50.81:9300}{xpack.installed=true}] from last-known cluster state; node term 1087, last-accepted version 125480 in term 11

and es-4:

[<time..>][WARN ][o.e.c.c.ClusterFormationFailureHelper] [es-4] master not discovered or elected yet, an election requires at least 3 nodes with ids from [2IE4RVpNTfKpL5JQsbvPCQ, 4womI-u8TMS_lxDytQ0kGg, 8DDEaYnnQKGkK9ea2klnnw, DuIW_Q65QFeZWOzwcyKlXA, TxpUfaPDTrSCngTNuZ_Brg] and at least 2 nodes with ids from [4womI-u8TMS_lxDytQ0kGg, 8DDEaYnnQKGkK9ea2klnnw, TxpUfaPDTrSCngTNuZ_Brg], have discovered [{es-1}{Vmau018eQWO3AjMzSuo8sQ}{3Ri3q7cvQEGV-p11ykEkCA}{192.168.50.80}{192.168.50.80:9300}{xpack.installed=true}, {es-2}{TxpUfaPDTrSCngTNuZ_Brg}{X0wUWTMrTLm1tDE2JMHZ2A}{192.168.50.81}{192.168.50.81:9300}{xpack.installed=true}] which is not a quorum; discovery will continue using [192.168.50.80:9300, 192.168.50.81:9300] from hosts providers and [{es-4}{8DDEaYnnQKGkK9ea2klnnw}{EtBJ1BycTBCTIANy3Pp_MA}{192.168.50.83}{192.168.50.83:9300}{xpack.installed=true}] from last-known cluster state; node term 1087, last-accepted version 125480 in term 11

I don't understand why election requires three nodes, is the version 6 minimum_master_nodes setting not used at all?
If so, was requirement of three nodes enforced when cluster had six nodes?

I have tried adding setting

cluster.initial_master_nodes: [es-1, es-2]

But that doesn't make a difference.

Is it possible to reset cluster formation settings/requirements?

plundgren · June 28, 2019, 2:34pm

Seems like it is the id for node es-1

Vmau018eQWO3AjMzSuo8sQ

that doesn't correspond to any of the required IDs, making requirement of 3 nodes to fail.

Askia_Mohamed_Kadri · June 28, 2019, 2:49pm

How did you removed the 3 nodes from your cluster?
Are you sure you changed the settings for all your nodes or killed them all before updating the settings?

You're providing too much settings. cluster.initial_master_nodes is a setting from version 7.+ of Elasticsearch search. Seems like you're using version 6.x

plundgren · June 28, 2019, 2:56pm

The instances for the 3 nodes were removed/wiped.
Setting "cluster.routing.allocation.exclude._ip" was including all of the IPs corresponding to the 3 nodes that were wiped. All shards except _security shard was removed from them.
Cluster is of version 7.1, upgraded form previously 6 and that is why some old settings have been in use.
I will remove the cluster.initial_master_nodes setting, it was only added to try and force the cluster to settle with available nodes.

DavidTurner · June 28, 2019, 3:02pm

It's used briefly during a rolling upgrade, but otherwise it's ignored, yes.

The docs on removing master nodes say:

... if you shut down half or more of the master-eligible nodes all at the same time then the cluster will normally become unavailable. [...]
As long as there are at least three master-eligible nodes in the cluster, as a general rule it is best to remove nodes one-at-a-time ...

This is because none of the remaining nodes might have the latest cluster state. The only safe path forward is to bring one or more of the missing nodes back online. The next best thing is to restore the cluster from a snapshot.

plundgren · June 28, 2019, 3:11pm

Right, so condition was fulfilled since half of the master eligible nodes were wiped.

What are the unsafe paths? Thinking more in terms of resetting the cluster formation settings/requirements? Changing requirements of node IDs to include ID for node es-1 for example.

DavidTurner · June 28, 2019, 3:15pm

Restoring from a snapshot is unsafe if you've added data since your last snapshot, because unfortunately that data won't be restored. Hopefully it's possible to ingest it again?

plundgren · June 28, 2019, 5:28pm

Im afraid we are missing too much data since the latest snapshot was taken. Im very interested in forcing the nodes to accept each other.

I know that the wiped nodes "only" contained shards from _security index, hence all other data indices should be intact.

Any possibility to achieve this?

DavidTurner · June 28, 2019, 7:12pm

That might be true, but the trouble is that the metadata that keeps these indices readable and consistent and so on is kept on the master nodes, and you may not have a valid copy of this metadata any more.

It's possible that the elasticsearch-node tool can do what you need, although please read the instructions carefully and understand that there's no guarantee it'll help. You may silently lose data following this path.

plundgren · June 28, 2019, 10:25pm

Yes, un unsafe cluster bootstrapping followed by detach from cluster solved it.

Thanks a lot David!

Note:
Only thing that I have found out were missing was the ILM policies.
Cluster state was the same for all three nodes, just picked one.

system · July 26, 2019, 10:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
2 node cluster on Elasticsearch 7.0.1 Elasticsearch	6	1568	August 5, 2019
Master not discovered, removed nodes have been totally destroyed Elasticsearch	8	1959	December 4, 2019
How to restore the ElasticSearch cluster when more than half of the master-eligible nodes are down simultaneously Elastic Search elastic-app-search	2	151	February 22, 2024
Master not discovered or elected yet, an election requires a node with id [F-Tn-Q6vQuKE0Fgi5qtUMg] + 503 master not discovered exception Elasticsearch	23	2150	April 21, 2024
Unable to form a cluster elasticsearch 8.0.0 Elasticsearch	1	357	April 1, 2022

Unable to form cluster after half of cluster nodes was removed - ES 7.1

Related topics