Unable to form cluster after half of cluster nodes was removed - ES 7.1

Hi,

Had six nodes (all master eligible) running in our cluster and then three nodes was removed, volumes gone. The three nodes that was removed were excluded from shard allocation with setting:

"cluster.routing.allocation.exclude._ip"

and had only one shard per node, belonging to the _security index.

After the nodes was removed this failure arose (from log):

elasticsearch security index is unavailable short circuiting retrieval of user

Settings for cluster discovery were as follows (legacy from v6):

discovery.zen.ping.unicast.hosts: 2,
discovery.zen.minimum_master_nodes: "192.168.50.80:9300, 192.168.50.81:9300, 192.168.50.83:9300"

My issue now is that the cluster wont elect a master since the requirements aren't fulfilled, from log on node es-1:

[<time..>][WARN ][o.e.c.c.ClusterFormationFailureHelper] [es-1] master not discovered or elected yet, an election requires at least 3 nodes with ids from [2IE4RVpNTfKpL5JQsbvPCQ, 4womI-u8TMS_lxDytQ0kGg, 8DDEaYnnQKGkK9ea2klnnw, DuIW_Q65QFeZWOzwcyKlXA, TxpUfaPDTrSCngTNuZ_Brg] and at least 2 nodes with ids from [4womI-u8TMS_lxDytQ0kGg, 8DDEaYnnQKGkK9ea2klnnw, TxpUfaPDTrSCngTNuZ_Brg], have discovered [{es-2}{TxpUfaPDTrSCngTNuZ_Brg}{X0wUWTMrTLm1tDE2JMHZ2A}{192.168.50.81}{192.168.50.81:9300}{xpack.installed=true}, {es-4}{8DDEaYnnQKGkK9ea2klnnw}{EtBJ1BycTBCTIANy3Pp_MA}{192.168.50.83}{192.168.50.83:9300}{xpack.installed=true}] which is not a quorum; discovery will continue using [192.168.50.81:9300, 192.168.50.83:9300] from hosts providers and [{es-1}{Vmau018eQWO3AjMzSuo8sQ}{3Ri3q7cvQEGV-p11ykEkCA}{192.168.50.80}{192.168.50.80:9300}{xpack.installed=true}] from last-known cluster state; node term 1087, last-accepted version 125480 in term 11

Same for node es-2:

[<time...>][WARN ][o.e.c.c.ClusterFormationFailureHelper] [es-2] master not discovered or elected yet, an election requires at least 3 nodes with ids from [2IE4RVpNTfKpL5JQsbvPCQ, 4womI-u8TMS_lxDytQ0kGg, 8DDEaYnnQKGkK9ea2klnnw, DuIW_Q65QFeZWOzwcyKlXA, TxpUfaPDTrSCngTNuZ_Brg] and at least 2 nodes with ids from [4womI-u8TMS_lxDytQ0kGg, 8DDEaYnnQKGkK9ea2klnnw, TxpUfaPDTrSCngTNuZ_Brg], have discovered [{es-1}{Vmau018eQWO3AjMzSuo8sQ}{3Ri3q7cvQEGV-p11ykEkCA}{192.168.50.80}{192.168.50.80:9300}{xpack.installed=true}, {es-4}{8DDEaYnnQKGkK9ea2klnnw}{EtBJ1BycTBCTIANy3Pp_MA}{192.168.50.83}{192.168.50.83:9300}{xpack.installed=true}] which is not a quorum; discovery will continue using [192.168.50.80:9300, 192.168.50.83:9300] from hosts providers and [{es-2}{TxpUfaPDTrSCngTNuZ_Brg}{X0wUWTMrTLm1tDE2JMHZ2A}{192.168.50.81}{192.168.50.81:9300}{xpack.installed=true}] from last-known cluster state; node term 1087, last-accepted version 125480 in term 11

and es-4:

[<time..>][WARN ][o.e.c.c.ClusterFormationFailureHelper] [es-4] master not discovered or elected yet, an election requires at least 3 nodes with ids from [2IE4RVpNTfKpL5JQsbvPCQ, 4womI-u8TMS_lxDytQ0kGg, 8DDEaYnnQKGkK9ea2klnnw, DuIW_Q65QFeZWOzwcyKlXA, TxpUfaPDTrSCngTNuZ_Brg] and at least 2 nodes with ids from [4womI-u8TMS_lxDytQ0kGg, 8DDEaYnnQKGkK9ea2klnnw, TxpUfaPDTrSCngTNuZ_Brg], have discovered [{es-1}{Vmau018eQWO3AjMzSuo8sQ}{3Ri3q7cvQEGV-p11ykEkCA}{192.168.50.80}{192.168.50.80:9300}{xpack.installed=true}, {es-2}{TxpUfaPDTrSCngTNuZ_Brg}{X0wUWTMrTLm1tDE2JMHZ2A}{192.168.50.81}{192.168.50.81:9300}{xpack.installed=true}] which is not a quorum; discovery will continue using [192.168.50.80:9300, 192.168.50.81:9300] from hosts providers and [{es-4}{8DDEaYnnQKGkK9ea2klnnw}{EtBJ1BycTBCTIANy3Pp_MA}{192.168.50.83}{192.168.50.83:9300}{xpack.installed=true}] from last-known cluster state; node term 1087, last-accepted version 125480 in term 11

I don't understand why election requires three nodes, is the version 6 minimum_master_nodes setting not used at all?
If so, was requirement of three nodes enforced when cluster had six nodes?

I have tried adding setting

cluster.initial_master_nodes: [es-1, es-2]

But that doesn't make a difference.

Is it possible to reset cluster formation settings/requirements?

Seems like it is the id for node es-1

Vmau018eQWO3AjMzSuo8sQ

that doesn't correspond to any of the required IDs, making requirement of 3 nodes to fail.

How did you removed the 3 nodes from your cluster?
Are you sure you changed the settings for all your nodes or killed them all before updating the settings?

You're providing too much settings. cluster.initial_master_nodes is a setting from version 7.+ of Elasticsearch search. Seems like you're using version 6.x

The instances for the 3 nodes were removed/wiped.
Setting "cluster.routing.allocation.exclude._ip" was including all of the IPs corresponding to the 3 nodes that were wiped. All shards except _security shard was removed from them.
Cluster is of version 7.1, upgraded form previously 6 and that is why some old settings have been in use.
I will remove the cluster.initial_master_nodes setting, it was only added to try and force the cluster to settle with available nodes.

It's used briefly during a rolling upgrade, but otherwise it's ignored, yes.

The docs on removing master nodes say:

... if you shut down half or more of the master-eligible nodes all at the same time then the cluster will normally become unavailable. [...]
As long as there are at least three master-eligible nodes in the cluster, as a general rule it is best to remove nodes one-at-a-time ...

This is because none of the remaining nodes might have the latest cluster state. The only safe path forward is to bring one or more of the missing nodes back online. The next best thing is to restore the cluster from a snapshot.

Right, so condition was fulfilled since half of the master eligible nodes were wiped.

What are the unsafe paths? Thinking more in terms of resetting the cluster formation settings/requirements? Changing requirements of node IDs to include ID for node es-1 for example.

Restoring from a snapshot is unsafe if you've added data since your last snapshot, because unfortunately that data won't be restored. Hopefully it's possible to ingest it again?

Im afraid we are missing too much data since the latest snapshot was taken. Im very interested in forcing the nodes to accept each other.

I know that the wiped nodes "only" contained shards from _security index, hence all other data indices should be intact.

Any possibility to achieve this?

That might be true, but the trouble is that the metadata that keeps these indices readable and consistent and so on is kept on the master nodes, and you may not have a valid copy of this metadata any more.

It's possible that the elasticsearch-node tool can do what you need, although please read the instructions carefully and understand that there's no guarantee it'll help. You may silently lose data following this path.

1 Like

Yes, un unsafe cluster bootstrapping followed by detach from cluster solved it.

Thanks a lot David!

Note:
Only thing that I have found out were missing was the ILM policies.
Cluster state was the same for all three nodes, just picked one.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.