Master node discovery problem

Hello,

My cluster has 3 master nodes. But during deployment, two nodes are not able to join cluster. Their log says "an election requires at least 2 nodes with ids" but only one is discovered and "not a quorum".
Then I tried to manually start a master node(on 10.158.112.146) hoping to form a quorum with the working master 113.17, with:
-e "cluster.initial_cluster_manager_nodes=10.158.113.17:9300" -e "discovery.seed_hosts=10.158.113.17:9300", but then I got
"master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [10.158.113.17:9300] to bootstrap a cluster: have discovered [{6f54aa8a47a5}{MBKRu7fWSe6XDSYDINANnw}{QEZgA7hvQruL0-KDijy6qA}{10.158.112.146}{10.158.112.146:9300}{m}{aws_availability_zone=us-east-1b, shard_indexing_pressure_enabled=true}, {93024391fa66}{ZcoJRj24R0u7HR41UewE0Q}{OGZjQoOaToKPkshdWz3IeA}{10.158.113.17}{10.158.113.17:9300}{m}{aws_availability_zone=us-east-1d, shard_indexing_pressure_enabled=true}, ...".

what confuses me is, it has already discovered "10.158.113.17:9300", which is the "must discover master-eligible nodes", why it still failed to discover?

I pass parameters in docker env variables, they are:

   -e "bootstrap.memory_lock=true"  \ 
           -e "cloud.node.auto_attributes=true"  \ 
           -e "cluster.name=mycluster"  \ 
           -e "cluster.routing.allocation.awareness.attributes=aws_availability_zone"  \ 
           -e "discovery.ec2.tag.SearchName=mytag"  \ 
           -e "discovery.seed_providers=ec2"  \ 
           -e "network.publish_host=_ec2:privateIp_"  \ 
           -e "node.roles=master"  \ 
           -e "plugin.mandatory=discovery-ec2"  \ 
           -e "cluster.initial_master_nodes=10.158.113.17:9300" \
           -e "discovery.seed_hosts=10.158.113.17:9300" \

Full discover log:
master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [10.158.113.17:9300] to bootstrap a cluster: have discovered [{6f54aa8a47a5}{MBKRu7fWSe6XDSYDINANnw}{QEZgA7hvQruL0-KDijy6qA}{10.158.112.146}{10.158.112.146:9300}{m}{aws_availability_zone=us-east-1b, shard_indexing_pressure_enabled=true}, {93024391fa66}{ZcoJRj24R0u7HR41UewE0Q}{OGZjQoOaToKPkshdWz3IeA}{10.158.113.17}{10.158.113.17:9300}{m}{aws_availability_zone=us-east-1d, shard_indexing_pressure_enabled=true}, {86733a8c6fae}{sFt9J5-9QUWIJGsaJNbrrw}{0uFourUbTsqrBO-DvXEuzw}{10.158.112.30}{10.158.112.30:9300}{m}{aws_availability_zone=us-east-1a, shard_indexing_pressure_enabled=true}, {f57c1dfb9b32}{B8td7hNZSTi91LpU9nlhBA}{OdwtvZCJTEebqd33y3vsNg}{10.158.112.201}{10.158.112.201:9300}{m}{aws_availability_zone=us-east-1b, shard_indexing_pressure_enabled=true}]; discovery will continue using [10.158.113.17:9300, 10.158.112.146:9300, 10.158.112.165:9300, 10.158.112.30:9300, 10.158.113.18:9300, 10.158.113.37:9300, 10.158.112.201:9300, 10.158.113.17:9300, 10.158.112.100:9300] from hosts providers and [{6f54aa8a47a5}{MBKRu7fWSe6XDSYDINANnw}{QEZgA7hvQruL0-KDijy6qA}{10.158.112.146}{10.158.112.146:9300}{m}{aws_availability_zone=us-east-1b, shard_indexing_pressure_enabled=true}] from last-known cluster state; node term 0, last-accepted version 0 in term 0

Thank you.

I found the only living master, 113.17, is also failing to form a cluster because it is expecting another node(s), but those nodes are already gone.
I couldn't find a way to dynamically change "cluster.initial_master_nodes" and "discovery.seed_hosts" to let it form cluster without other nodes.
So I cannot start new node joining cluster, I cannot change existing master to elect itself.
I finally tried to stop this master and restart with initial node settings, and a new cluster uuid created and now I'm recovering from snapshot.

Still, if anyone has any idea how to fix this discover issue, I will be much appreciated. Thanks.

That's about the best you can do if you can't bring enough of the old master-eligible nodes back to health. See these docs for more information:

If the logs or the health report indicate that Elasticsearch can’t discover enough nodes to form a quorum, you must address the reasons preventing Elasticsearch from discovering the missing nodes. The missing nodes are needed to reconstruct the cluster metadata. Without the cluster metadata, the data in your cluster is meaningless. The cluster metadata is stored on a subset of the master-eligible nodes in the cluster. If a quorum can’t be discovered, the missing nodes were the ones holding the cluster metadata. [...] If you can’t start enough nodes to form a quorum, start a new cluster and restore data from a recent snapshot.

thanks for response.

So if I understand it correctly:

  1. When more than 50% of the master nodes(2 out of 3) are down(EC2 shutdown), the whole cluster will down and needs recovery from snapshot?
    Then I think it will be better to use 5 masters than 3? Since loosing 3 nodes at the same time is less frequently than loosing 2?
  2. Even data nodes still hold the data on local disk before cluster down, we have to recover from snapshot instead of reusing those data on disk.

thanks again.

  1. EC2 failures that cause the permanent loss of a node should be pretty rare. Losing two at once seems incredibly unlikely. We don't run any production workloads with more than 3 masters.

  2. That's correct, data nodes hold data, but not metadata. Without the metadata, the data is meaningless.

This time it happened when we rolling update autoscaling group, so maybe one node down accidentlly when another one is updating.
Thanks for the help.

Ah right yeah EC2 auto-scaling groups only really work for stateless services, you can't really get them to work reliably for this purpose.