Issue with master nodes being deleted and no new master being selected

Hi, all, thanks in advance for taking the time to read through this:

I am running a cluster on a docker swarm on a bare metal server and I have gotten into a state where the master node is never selected:

{
    "type": "server",
    "timestamp": "2022-03-16T21:44:23,314Z",
    "level": "WARN",
    "component": "o.e.c.c.ClusterFormationFailureHelper",
    "cluster.name": "es-docker-cluster",
    "node.name": "es01",
    "message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [-P0HmPrNQo2OILIFebfaiQ, iHssXU2PQ1mu0h8kVEApsA, rcPqEpLzQrOKbOdIy5rtXA], have only discovered non-quorum [{es01}{-P0HmPrNQo2OILIFebfaiQ}{YG1huot6SiWMFUETrJFkYA}{10.0.7.12}{10.0.7.12:9300}{cdfhilmrstw}, {es02}{pfKwaVGFRPC32I_1VdWaqA}{1oraGkXRQrGt774ShU8qUw}{10.0.7.6}{10.0.7.6:9300}{cdfhilmrstw}, {es09}{ALFQG_ufTJWErofgyJgKVA}{6WY7pFdxTiitQSv-cYn_cw}{10.0.7.20}{10.0.7.20:9300}{cdfhilmrstw}, {es10}{r0WdgFU9TSSELTO0iTolMw}{yX1gg8JNSfutUfMhH6h_2Q}{10.0.7.21}{10.0.7.21:9300}{cdfhilmrstw}]; discovery will continue using [10.0.7.6:9300, 10.0.7.20:9300, 10.0.7.21:9300] from hosts providers and [{es01}{-P0HmPrNQo2OILIFebfaiQ}{YG1huot6SiWMFUETrJFkYA}{10.0.7.12}{10.0.7.12:9300}{cdfhilmrstw}] from last-known cluster state; node term 1347, last-accepted version 287263 in term 1347"
}

short story here is that the master nodes that the cluster is waiting for don't exist -- they went down when a RAM stick failed on the server

Mistakes I made:

  • I took the cluster down without calling the appropriate voting_config_exclusions API.
  • I had misconfigured my cluster to have the master nodes also be data nodes

I found out later that these mistakes led to the predicament I now find myself in.

Here's what I've tried:

POST /_cluster/voting_config_exclusions?node_ids=iHssXU2PQ1mu0h8kVEApsA,rcPqEpLzQrOKbOdIy5rtXA
  • this didn't work because no master node is elected

I tried fiddling around with node.master and node.voting_only and node.data in the docker-stack.yml but it had no effect.

I tried removing the volume associated with -P0HmPrNQo2OILIFebfaiQ (the only master eligible node found before) and using a fresh volume as described here (kubernetes - Elasticsearch 7.2.0: master not discovered or elected yet, an election requires at least X nodes - Stack Overflow)

  • this ended up also not working, initially the state of the cluster showed these logs:
{
    "type": "server",
    "timestamp": "2022-03-17T19:26:45,098Z",
    "level": "WARN",
    "component": "o.e.c.c.ClusterFormationFailureHelper",
    "cluster.name": "es-docker-cluster",
    "node.name": "es01",
    "message": "master not discovered or elected yet, an election requires 3 nodes with ids [6impsuPPQiC1PtSgixyOMQ, r0WdgFU9TSSELTO0iTolMw, mkSI0yRYQR-A0nf8zkkFHg], have discovered possible quorum [{es01}{mkSI0yRYQR-A0nf8zkkFHg}{4zms7RjESe6TA1FqR6flMA}{10.0.16.5}{10.0.16.5:9300}{cdfhilmrstw}, {es10}{r0WdgFU9TSSELTO0iTolMw}{hAi6b_07Qh6_xVe2-n6kVg}{10.0.16.4}{10.0.16.4:9300}{cdfhilmrstvw}, {es02}{6impsuPPQiC1PtSgixyOMQ}{wUpjCjdSTK2R8LFJMBfDUQ}{10.0.16.7}{10.0.16.7:9300}{cdfhilmrstvw}, {es09}{ALFQG_ufTJWErofgyJgKVA}{yFtFItIFSBiVKfFaDP1oVA}{10.0.16.2}{10.0.16.2:9300}{cdfhilmrstvw}]; discovery will continue using [10.0.16.7:9300, 10.0.16.2:9300, 10.0.16.4:9300] from hosts providers and [{es01}{mkSI0yRYQR-A0nf8zkkFHg}{4zms7RjESe6TA1FqR6flMA}{10.0.16.5}{10.0.16.5:9300}{cdfhilmrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0"
}

I figured this might be an issue of the data needing to be copied back over into es01 because it has a fresh volume. Well, more than 12 hours later, here's the state:

{
    "type": "server",
    "timestamp": "2022-03-18T13:16:41,071Z",
    "level": "WARN",
    "component": "o.e.c.c.ClusterFormationFailureHelper",
    "cluster.name": "es-docker-cluster",
    "node.name": "es01",
    "message": "master not discovered or elected yet, an election requires 3 nodes with ids [6impsuPPQiC1PtSgixyOMQ, r0WdgFU9TSSELTO0iTolMw, mkSI0yRYQR-A0nf8zkkFHg], have only discovered non-quorum [{es01}{mkSI0yRYQR-A0nf8zkkFHg}{tRFA5C5MTkuO3UTcfm3EVg}{10.0.17.4}{10.0.17.4:9300}{cdfhilmrstw}, {es02}{WteiVwQ0SVqnztTdDm62dw}{PqXu1DrFQAiCVc1nl1DDhg}{10.0.17.6}{10.0.17.6:9300}{cdfhilmrstvw}, {es10}{r0WdgFU9TSSELTO0iTolMw}{SZUPH4TDRlalJnvaxPIvxg}{10.0.17.7}{10.0.17.7:9300}{cdfhilmrstvw}, {es09}{ALFQG_ufTJWErofgyJgKVA}{-PTYvKF9QHOHbuGkecGg5Q}{10.0.17.10}{10.0.17.10:9300}{cdfhilmrstvw}]; discovery will continue using [10.0.17.6:9300, 10.0.17.10:9300, 10.0.17.7:9300] from hosts providers and [{es01}{mkSI0yRYQR-A0nf8zkkFHg}{tRFA5C5MTkuO3UTcfm3EVg}{10.0.17.4}{10.0.17.4:9300}{cdfhilmrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0"
}

es02 seems to have changed its ID from 6impsuPPQiC1PtSgixyOMQ to WteiVwQ0SVqnztTdDm62dw and now I am at a loss for how to proceed to get back to a usable state.

Appreciate any help/feedback, thanks

The cluster metadata is stored on a majority of the master nodes - in this case it seems it was the nodes with IDs iHssXU2PQ1mu0h8kVEApsA and rcPqEpLzQrOKbOdIy5rtXA both of which are no longer available. You'll need to bring at least one of them back in order to continue using this cluster. Without the cluster metadata the data in the cluster is unfortunately meaningless. If you cannot restore one of these two nodes, you will have to recover your cluster from a recent snapshot.

1 Like

thanks, guess I'm out of luck on this one. I have been using the same nodes and same volumes but it's showing different ID's when I brought it up after swapping out the volume for es01.

Out of curiosity, is there a reason why the ID's for a node would change?

Appreciate it

Elasticsearch only makes a new node ID if you start it on an empty data path, which effectively means it's not the same node any more.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.