Restored data to fresh cluster incorrectly and receiving "master not discovered or elected yet"

Hi

I had a 3 nodes cluster running on on Kubernetes. There were space issues with the volumes so I was forced to provision new volumes. Before I read this article I decided to make a copy of the data directory, I created a new 3 node cluster and restored the data directory. I do of course realise now that this was not the correct way to do this but I am hoping it is not too late to recover my data.

This is the error I receive now

{"type": "server", "timestamp": "2020-08-11T17:55:53,439Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "k8s-logs", "node.name": "es-cluster-2", "message": "master not discovered or elected yet, an election requires at least 3 nodes with ids from [61_bqJv4Q1CqUYnXR4SliQ, DxDiXlTCQw2r86hvOgFZTA, CH9ToyI_Siqmu41a8LdecQ, aowKPk47SMeK0k9nph6yoA, 3eWboZMGSoOmM0UA49SsJA], have discovered [{es-cluster-2}{61_bqJv4Q1CqUYnXR4SliQ}{QOai1LR8Rb6n-0umaedE7g}{10.10.2.236}{10.10.2.236:9300}{dilmrt}{ml.machine_memory=2799996928, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {es-cluster-1}{3eWboZMGSoOmM0UA49SsJA}{Bgw_mExBQQiPdM8Oo1g1Hw}{10.10.1.98}{10.10.1.98:9300}{dilmrt}{ml.machine_memory=2799996928, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}, {es-cluster-0}{CH9ToyI_Siqmu41a8LdecQ}{Y8L2EahpTf6tCYHRBe_czQ}{10.10.3.64}{10.10.3.64:9300}{dilmrt}{ml.machine_memory=2799996928, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] which is a quorum; discovery will continue using [10.10.3.64:9300, 10.10.1.98:9300] from hosts providers and [{es-cluster-2}{61_bqJv4Q1CqUYnXR4SliQ}{QOai1LR8Rb6n-0umaedE7g}{10.10.2.236}{10.10.2.236:9300}{dilmrt}{ml.machine_memory=2799996928, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] from last-known cluster state; node term 2923, last-accepted version 146555 in term 2923" }

I read other posts about "master not discovered or elected yet" but I see the difference between the errors in those posts and mine is that the nodes ids are being discovered.

Is there anything I can do to maybe force a master to be elected or reset the election process. I still have the data directories so can you advise please on any other way I can restore my data.

Thanks
Ronen

You seemed to have had a 5-node cluster:

an election requires at least 3 nodes with ids from [61_bqJv4Q1CqUYnXR4SliQ, DxDiXlTCQw2r86hvOgFZTA, CH9ToyI_Siqmu41a8LdecQ, aowKPk47SMeK0k9nph6yoA, 3eWboZMGSoOmM0UA49SsJA]

It looks like you've restored at least 3 of them; I imagine they're all logging similar-looking messages but they will be subtly different and the differences are important. Can you share all of these messages?

Thanks for the response. There wasnt at any point specifically a 5 nodes cluster, especially since there were only 3 persistent volumes so im not sure how it thinks there were 5.

Please see links to logs for each node. I have enabled trace logging. The top line of each file is the main error.

es-cluster-0:

es-cluster-1:

es-cluster-2:

Thanks
Ronen

Thanks, that's helpful. Here's the problem:

an election requires at least 3 nodes with ids from ...
[                        DxDiXlTCQw2r86hvOgFZTA, CH9ToyI_Siqmu41a8LdecQ, aowKPk47SMeK0k9nph6yoA, DDm_M-p1QzStgEBUq4poAg, 3eWboZMGSoOmM0UA49SsJA]
[                        DxDiXlTCQw2r86hvOgFZTA, CH9ToyI_Siqmu41a8LdecQ, aowKPk47SMeK0k9nph6yoA, DDm_M-p1QzStgEBUq4poAg, 3eWboZMGSoOmM0UA49SsJA]
[61_bqJv4Q1CqUYnXR4SliQ, DxDiXlTCQw2r86hvOgFZTA, CH9ToyI_Siqmu41a8LdecQ, aowKPk47SMeK0k9nph6yoA,                         3eWboZMGSoOmM0UA49SsJA]

There are actually 6 different node IDs in play, so for a majority you need at least 4 of them to be present. Technically you only need 3 from each subset of 5 mentioned above, but in practice this means the same thing: you're missing a node, without which Elasticsearch cannot reconstruct the cluster state.

These nodes were, at some point in the past, all present in this cluster at the same time. No idea how, sorry, but Elasticsearch doesn't invent these node IDs freely so the only explanation is that you had more nodes than you do now.

Ok thanks I understand now. I must have done something wrong with the backup/restore.

Now the question is; is there any way to recover the data?

I would try the restore again in the hope that wherever these extra nodes came from it happened after you took the backup. Make sure you shut everything down, restore the data paths of all the nodes to their new locations, and only then start things up.

If these extra nodes are present in the backup then I'm sorry to say that the backup doesn't include the latest cluster state so there's no safe way to recover the cluster.

Thanks for the help David! I appreciate the quick responses!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.