Need help with master nodes issue and UUID's

hi,

We have a cluster with two master nodes and two data nodes, we had an issue where master pods show "master not discovered exception", so we deleted the PVC of master-1 and the master-1 pod, and re-created it, now master-0 and data nodes won't join (different UUID's) the cluster,

master-0 has no UUID and can't join the cluster
data nodes UUID's are the same
master-1 UUID is different.

Is it possible to remove master-1 and promote master-0 to master and have it adopt UUID's from the data pods? this would solve our issue,

Thanks.

What do you have in the logs of each node? Please share the log, also share your elasticsearch.yml configuration for all nodes.

Also, you cannot have a resilient cluster with just 2 master eligible nodes, so a cluster with 2 master eligible nodes is basically the same as a cluster with just one master node. [documentation]

From what you described it looks like the master-1 node was the active master and by removing it you removed the information about your cluster.

Do you have snapshots? I'm not sure you can recover from this.

1 Like

That indicates you removed half of the master nodes in your cluster, which these docs say not to do:

[IMPORTANT]
To be sure that the cluster remains available you must not stop half or more of the nodes in the voting configuration at the same time. As long as more than half of the voting nodes are available the cluster can still work normally. This means that if there are three or four master-eligible nodes, the cluster can tolerate one of them being unavailable. If there are two or fewer master-eligible nodes, they must all remain available.

If you stop half or more of the nodes in the voting configuration at the same time then the cluster will be unavailable until you bring enough nodes back online to form a quorum again. While the cluster is unavailable, any remaining nodes will report in their logs that they cannot discover or elect a master node. See Troubleshooting discovery for more information.

See also these docs on safely removing master-eligible nodes:

If there are only two master-eligible nodes remaining then neither node can be safely removed since both are required to reliably make progress. To remove one of these nodes you must first inform Elasticsearch that it should not be part of the voting configuration, and that the voting power should instead be given to the other node. You can then take the excluded node offline without preventing the other node from making progress.

The linked troubleshooting docs indicate that the missing master held data which was vital to your cluster:

If the logs or the health report indicate that Elasticsearch can’t discover enough nodes to form a quorum, you must address the reasons preventing Elasticsearch from discovering the missing nodes. The missing nodes are needed to reconstruct the cluster metadata. Without the cluster metadata, the data in your cluster is meaningless. The cluster metadata is stored on a subset of the master-eligible nodes in the cluster. If a quorum can’t be discovered, the missing nodes were the ones holding the cluster metadata. [...] If you can’t start enough nodes to form a quorum, start a new cluster and restore data from a recent snapshot.

That suggests you hadn't removed the cluster.initial_master_nodes setting - see these docs:

[IMPORTANT]
After the cluster has formed, remove the cluster.initial_master_nodes setting from each node’s configuration and never set it again for this cluster. Do not configure this setting on nodes joining an existing cluster. Do not configure this setting on nodes which are restarting. Do not configure this setting when performing a full-cluster restart.

If you leave cluster.initial_master_nodes in place once the cluster has formed then there is a risk that a future misconfiguration may result in bootstrapping a new cluster alongside your existing cluster. It may not be possible to recover from this situation without losing data.

Unfortunately you're going to have to restore the cluster from a recent snapshot.

1 Like

thanks so much for your reply. I was able to restart the whole cluster and restore a snapshot we took before it started having issues.

2 Likes