Master node discovery problem

esrcon · March 28, 2024, 5:19pm

Hello,

My cluster has 3 master nodes. But during deployment, two nodes are not able to join cluster. Their log says "an election requires at least 2 nodes with ids" but only one is discovered and "not a quorum".
Then I tried to manually start a master node(on 10.158.112.146) hoping to form a quorum with the working master 113.17, with:
-e "cluster.initial_cluster_manager_nodes=10.158.113.17:9300" -e "discovery.seed_hosts=10.158.113.17:9300", but then I got
"master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [10.158.113.17:9300] to bootstrap a cluster: have discovered [{6f54aa8a47a5}{MBKRu7fWSe6XDSYDINANnw}{QEZgA7hvQruL0-KDijy6qA}{10.158.112.146}{10.158.112.146:9300}{m}{aws_availability_zone=us-east-1b, shard_indexing_pressure_enabled=true}, {93024391fa66}{ZcoJRj24R0u7HR41UewE0Q}{OGZjQoOaToKPkshdWz3IeA}{10.158.113.17}{10.158.113.17:9300}{m}{aws_availability_zone=us-east-1d, shard_indexing_pressure_enabled=true}, ...".

what confuses me is, it has already discovered "10.158.113.17:9300", which is the "must discover master-eligible nodes", why it still failed to discover?

I pass parameters in docker env variables, they are:

   -e "bootstrap.memory_lock=true"  \ 
           -e "cloud.node.auto_attributes=true"  \ 
           -e "cluster.name=mycluster"  \ 
           -e "cluster.routing.allocation.awareness.attributes=aws_availability_zone"  \ 
           -e "discovery.ec2.tag.SearchName=mytag"  \ 
           -e "discovery.seed_providers=ec2"  \ 
           -e "network.publish_host=_ec2:privateIp_"  \ 
           -e "node.roles=master"  \ 
           -e "plugin.mandatory=discovery-ec2"  \ 
           -e "cluster.initial_master_nodes=10.158.113.17:9300" \
           -e "discovery.seed_hosts=10.158.113.17:9300" \

Full discover log:
master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [10.158.113.17:9300] to bootstrap a cluster: have discovered [{6f54aa8a47a5}{MBKRu7fWSe6XDSYDINANnw}{QEZgA7hvQruL0-KDijy6qA}{10.158.112.146}{10.158.112.146:9300}{m}{aws_availability_zone=us-east-1b, shard_indexing_pressure_enabled=true}, {93024391fa66}{ZcoJRj24R0u7HR41UewE0Q}{OGZjQoOaToKPkshdWz3IeA}{10.158.113.17}{10.158.113.17:9300}{m}{aws_availability_zone=us-east-1d, shard_indexing_pressure_enabled=true}, {86733a8c6fae}{sFt9J5-9QUWIJGsaJNbrrw}{0uFourUbTsqrBO-DvXEuzw}{10.158.112.30}{10.158.112.30:9300}{m}{aws_availability_zone=us-east-1a, shard_indexing_pressure_enabled=true}, {f57c1dfb9b32}{B8td7hNZSTi91LpU9nlhBA}{OdwtvZCJTEebqd33y3vsNg}{10.158.112.201}{10.158.112.201:9300}{m}{aws_availability_zone=us-east-1b, shard_indexing_pressure_enabled=true}]; discovery will continue using [10.158.113.17:9300, 10.158.112.146:9300, 10.158.112.165:9300, 10.158.112.30:9300, 10.158.113.18:9300, 10.158.113.37:9300, 10.158.112.201:9300, 10.158.113.17:9300, 10.158.112.100:9300] from hosts providers and [{6f54aa8a47a5}{MBKRu7fWSe6XDSYDINANnw}{QEZgA7hvQruL0-KDijy6qA}{10.158.112.146}{10.158.112.146:9300}{m}{aws_availability_zone=us-east-1b, shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 0, last-accepted version 0 in term 0

Thank you.

esrcon · March 28, 2024, 7:21pm

I found the only living master, 113.17, is also failing to form a cluster because it is expecting another node(s), but those nodes are already gone.
I couldn't find a way to dynamically change "cluster.initial_master_nodes" and "discovery.seed_hosts" to let it form cluster without other nodes.
So I cannot start new node joining cluster, I cannot change existing master to elect itself.
I finally tried to stop this master and restart with initial node settings, and a new cluster uuid created and now I'm recovering from snapshot.

Still, if anyone has any idea how to fix this discover issue, I will be much appreciated. Thanks.

DavidTurner · March 28, 2024, 11:29pm

That's about the best you can do if you can't bring enough of the old master-eligible nodes back to health. See these docs for more information:

If the logs or the health report indicate that Elasticsearch can’t discover enough nodes to form a quorum, you must address the reasons preventing Elasticsearch from discovering the missing nodes. The missing nodes are needed to reconstruct the cluster metadata. Without the cluster metadata, the data in your cluster is meaningless. The cluster metadata is stored on a subset of the master-eligible nodes in the cluster. If a quorum can’t be discovered, the missing nodes were the ones holding the cluster metadata. [...] If you can’t start enough nodes to form a quorum, start a new cluster and restore data from a recent snapshot.

esrcon · March 29, 2024, 4:02pm

thanks for response.

So if I understand it correctly:

When more than 50% of the master nodes(2 out of 3) are down(EC2 shutdown), the whole cluster will down and needs recovery from snapshot?
Then I think it will be better to use 5 masters than 3? Since loosing 3 nodes at the same time is less frequently than loosing 2?
Even data nodes still hold the data on local disk before cluster down, we have to recover from snapshot instead of reusing those data on disk.

thanks again.

DavidTurner · March 29, 2024, 4:26pm

EC2 failures that cause the permanent loss of a node should be pretty rare. Losing two at once seems incredibly unlikely. We don't run any production workloads with more than 3 masters.
That's correct, data nodes hold data, but not metadata. Without the metadata, the data is meaningless.

esrcon · March 29, 2024, 5:04pm

This time it happened when we rolling update autoscaling group, so maybe one node down accidentlly when another one is updating.
Thanks for the help.

DavidTurner · March 29, 2024, 5:10pm

Ah right yeah EC2 auto-scaling groups only really work for stateless services, you can't really get them to work reliably for this purpose.

system · April 26, 2024, 5:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Master not discovered or elected yet, an election requires 2 nodes with Elasticsearch	1	551	January 21, 2020
Master not discovered, even though quorum is fulfilled Elasticsearch	3	856	July 13, 2023
Master not discovered or elected yet, an election requires two nodes with ids Elasticsearch	1	412	February 8, 2021
Getting error like master not discovered or elected yet, an election requires at least 2 nodes Elasticsearch	5	1942	May 12, 2020
Master not discovered or elected yet, an election requires 2 nodes with ids Elasticsearch	3	666	July 22, 2020

Master node discovery problem

Related topics