Master not discovered or elected yet, an election requires a node with id [F-Tn-Q6vQuKE0Fgi5qtUMg] + 503 master not discovered exception

I have a elasticsearch cluster and it was work well before. But yesterday I accidentally deleted the Master node that has been elected. After this, the another master node cannot be elected by other nodes, at the same time, I got the error(below) on it's logging file:

master not discovered or elected yet, an election requires a node with id [F-Tn-Q6vQuKE0Fgi5qtUMg], have only discovered non-quorum [{OPT__Master2}{ezEK9jukQTGtVW2DP3cSjA}{tHNwll8bTlmlmNes1HJPJg}{OPT__Master2}{192.168.1.30}{192.168.1.30:9801}{m}];

and i wanted fix this by using the /_cluster/voting_config_exclusions API , but I got the another erro like below:

503 master not discovered exception

How can i do ?

See these docs for help:

In particular:

If the logs or the health report indicate that Elasticsearch can’t discover enough nodes to form a quorum, you must address the reasons preventing Elasticsearch from discovering the missing nodes. The missing nodes are needed to reconstruct the cluster metadata. Without the cluster metadata, the data in your cluster is meaningless. The cluster metadata is stored on a subset of the master-eligible nodes in the cluster. If a quorum can’t be discovered, the missing nodes were the ones holding the cluster metadata.

Ensure there are enough nodes running to form a quorum and that every node can communicate with every other node over the network. Elasticsearch will report additional details about network connectivity if the election problems persist for more than a few minutes. If you can’t start enough nodes to form a quorum, start a new cluster and restore data from a recent snapshot.

1 Like

How many master eligible nodes did your cluster initially have? Did you follow the guidelines in the documentation when setting up the cluster? Which version of Elasticsearch are you using?

1 Like

I can't getting back (recovery) the missing node and the cluster's meta data on it, I have tried many methods to recover it from the hard disk , but failed.
Now I follow what you said and checked the repository directory I saved as you said, and found that there were no snapshots. There were only some indicators and UUID directories in each repo like below:

(base) [root@hm-194 es-repo]# tree -d -L 3
.
├── backups
│   ├── market_subjects
│   │   └── indices
│   ├── patents
│   │   └── indices
│   └── SLRC
│       ├── indices
│       └── tests-yC7oYRGiRgS_uK20H-A7ug
└── long_term_backups
    └── old_market_subjects_data
        └── indices

11 directories

in this , the "market_subjects", "patents" and "old_market_subjects_data" is the repo name that I had created before. but the "SLRC" is the cluster name .
I didn't found any snapshot file , but I had seen the generated snapshot files list by my policy , so is it correct the data under the repo.path like above? or missing something ?

I think this issue answers Christian's questions: you only had two master-eligible nodes. But as the docs say:

A resilient cluster needs three master-eligible nodes so that if one of them fails then the remaining two still form a majority and can hold a successful election.

Unfortunately if you can't bring the node back online you'll need to build a new cluster and restore your data from backups.

2 Likes

If you have multiple nodes Elasticsearch snapshots requires a shared filesystem repository, e.g. NFS. Is this what you have configured and mounted on your nodes as the repo path?

1 Like

I had six data node , two master node before, but yesterday I had deleted one data node and one master node accidentally. The version of the all node is 8.3.3

I was originally thinking about whether there was any way to force the election of my other master, but now the answer is obvious. The other master node does not have the meta information of the cluster, so this is not feasible.

Now I just want you to help me confirm. Can I use the snapshots repo path I showed you above to recover data? Because I saw that even if there is no specific snapshot file, there is a lot of data stored in the indicies folder. Can I use them to restore my indexes on the newly created cluster? If it is possible, how to do it? I looked through the Elastic official documentation and only snapshots can be used to restore it, but after my attempts, I created a new cluster , and this repo.path is set and is empty in the snapshots list of kibana.

Yes, I had configured it on the ./config/elasticsearch.yml on every node ( master / data ).
截屏2024-03-22 15.13.19

@DavidTurner This is unfortunately not an uncommon issue even though it is covered by the docs. As far as I know there are no scenarios where bootstrapping a new cluster with 2 master-eligible nodes is recommended. Given that Elasticsearch does add a number of bootstrap checks for what is deemed to be production clusters, would verifying that the number of initial master nodes is not 2 be a good candidate for a new bootstrap check? If we wanted to still allow this for some reason, might it be suitable to then have the user explicitly enable an "unsafe operation mode" through a configuration setting? Maybe it might be useful to also log a warning if there is only a single master eligible node in the cluster that is not a single-node cluster?

1 Like

What type of storage are you using for the repo? Is it an NFS mount?

lvm but, the lvm didn't have any snapshots, so I tried the ext4magic is failed.

(base) [root@hm-194 home]# lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda               8:0    0    60T  0 disk 
├─sda1            8:1    0     1M  0 part 
├─sda2            8:2    0     1G  0 part /boot
└─sda3            8:3    0    60T  0 part 
  ├─centos-root 253:0    0    50G  0 lvm  /
  ├─centos-swap 253:1    0     4G  0 lvm  [SWAP]
  └─centos-home 253:2    0    60T  0 lvm  /home
nvme0n1         259:0    0 931.5G  0 disk /ssd1
nvme1n1         259:1    0 931.5G  0 disk /ssd2
(base) [root@hm-194 home]# 

That does not look like a shared filesystem. Is it?

For Elasticsearch snapshots to work they need shared storage, NFS storage accessible by all nodes over the network, so that files written by one node can be read from the repo by all other nodes. Having local directories under the same path on different machines does not work.

1 Like

Oh cry, I really didn’t know this was the case before. Is there really no other way to recover these data?

Anyway, Thank you( @DavidTurner , @Christian_Dahlqvist ) very much for taking the time to help me figure it out.

Bootstrap checks run far too early in startup to know how many nodes there are in the cluster unfortunately, but you're right, it'd be good to have something in this area. I think these days we could add this check to the health report so I opened #106640 to suggest that.

At bootstrap time you would know how initial master nodes is configured though, so you could check that this is not exactly 2 nodes. This would likely be enough catch a significant number of cases early on. You could still add it to the health report, but I am afraid this is likely to be overlooked just the docs often are.

1 Like

Yeah but cluster.initial_master_nodes should only be set the first time the cluster starts, and often folks will get to a 2-node cluster by growing from a one-node cluster, so I don't think this will catch enough cases.

1 Like

There is the elasticsearch-node tool that may be able to help you, but I have myself never had to use it.

1 Like

Very appreciate your help, I had fix my problem by the way you suggest.
Although some shards cannot be recovered due to my stupid behavior ( I did not set the number of replicas, snapshot settings and repo configuration correctly ). But it has already saved my life.

In order to facilitate more people who have the same problem as me, I plan to record my steps here:

  1. First, I deactivated all nodes in my catastrophically damaged cluster.

  2. Secondly, I picked the data node that occupied the largest storage space from my damaged cluster, first modified its configuration, changed its role to [master,data] and used the elasticsearch-node tool. Unsafe cluster bootstrapping, so I got a new cluster, but the data and ID of my node have not changed.
    In fact, some data can be recovered here, but it is still incomplete. This depends on your copy and chunking.

  3. Immediately afterwards, start the remaining nodes. In fact, more accurately, you should start moving nodes, that is, migrating nodes from the damaged cluster to new nodes. This process requires the detach-cluster operation.
    First execute the ./bin/elasticsearch-node detach-cluster command, then modify the discover in the configuration to point to the master node started in the first step, and then start the current node.

  • I hope everyone can gain something and help from it like I did :innocent: