Hi, all, thanks in advance for taking the time to read through this:
I am running a cluster on a docker swarm on a bare metal server and I have gotten into a state where the master node is never selected:
{
"type": "server",
"timestamp": "2022-03-16T21:44:23,314Z",
"level": "WARN",
"component": "o.e.c.c.ClusterFormationFailureHelper",
"cluster.name": "es-docker-cluster",
"node.name": "es01",
"message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [-P0HmPrNQo2OILIFebfaiQ, iHssXU2PQ1mu0h8kVEApsA, rcPqEpLzQrOKbOdIy5rtXA], have only discovered non-quorum [{es01}{-P0HmPrNQo2OILIFebfaiQ}{YG1huot6SiWMFUETrJFkYA}{10.0.7.12}{10.0.7.12:9300}{cdfhilmrstw}, {es02}{pfKwaVGFRPC32I_1VdWaqA}{1oraGkXRQrGt774ShU8qUw}{10.0.7.6}{10.0.7.6:9300}{cdfhilmrstw}, {es09}{ALFQG_ufTJWErofgyJgKVA}{6WY7pFdxTiitQSv-cYn_cw}{10.0.7.20}{10.0.7.20:9300}{cdfhilmrstw}, {es10}{r0WdgFU9TSSELTO0iTolMw}{yX1gg8JNSfutUfMhH6h_2Q}{10.0.7.21}{10.0.7.21:9300}{cdfhilmrstw}]; discovery will continue using [10.0.7.6:9300, 10.0.7.20:9300, 10.0.7.21:9300] from hosts providers and [{es01}{-P0HmPrNQo2OILIFebfaiQ}{YG1huot6SiWMFUETrJFkYA}{10.0.7.12}{10.0.7.12:9300}{cdfhilmrstw}] from last-known cluster state; node term 1347, last-accepted version 287263 in term 1347"
}
short story here is that the master nodes that the cluster is waiting for don't exist -- they went down when a RAM stick failed on the server
Mistakes I made:
- I took the cluster down without calling the appropriate voting_config_exclusions API.
- I had misconfigured my cluster to have the master nodes also be data nodes
I found out later that these mistakes led to the predicament I now find myself in.
Here's what I've tried:
POST /_cluster/voting_config_exclusions?node_ids=iHssXU2PQ1mu0h8kVEApsA,rcPqEpLzQrOKbOdIy5rtXA
- this didn't work because no master node is elected
I tried fiddling around with node.master
and node.voting_only
and node.data
in the docker-stack.yml
but it had no effect.
I tried removing the volume associated with -P0HmPrNQo2OILIFebfaiQ
(the only master eligible node found before) and using a fresh volume as described here (kubernetes - Elasticsearch 7.2.0: master not discovered or elected yet, an election requires at least X nodes - Stack Overflow)
- this ended up also not working, initially the state of the cluster showed these logs:
{
"type": "server",
"timestamp": "2022-03-17T19:26:45,098Z",
"level": "WARN",
"component": "o.e.c.c.ClusterFormationFailureHelper",
"cluster.name": "es-docker-cluster",
"node.name": "es01",
"message": "master not discovered or elected yet, an election requires 3 nodes with ids [6impsuPPQiC1PtSgixyOMQ, r0WdgFU9TSSELTO0iTolMw, mkSI0yRYQR-A0nf8zkkFHg], have discovered possible quorum [{es01}{mkSI0yRYQR-A0nf8zkkFHg}{4zms7RjESe6TA1FqR6flMA}{10.0.16.5}{10.0.16.5:9300}{cdfhilmrstw}, {es10}{r0WdgFU9TSSELTO0iTolMw}{hAi6b_07Qh6_xVe2-n6kVg}{10.0.16.4}{10.0.16.4:9300}{cdfhilmrstvw}, {es02}{6impsuPPQiC1PtSgixyOMQ}{wUpjCjdSTK2R8LFJMBfDUQ}{10.0.16.7}{10.0.16.7:9300}{cdfhilmrstvw}, {es09}{ALFQG_ufTJWErofgyJgKVA}{yFtFItIFSBiVKfFaDP1oVA}{10.0.16.2}{10.0.16.2:9300}{cdfhilmrstvw}]; discovery will continue using [10.0.16.7:9300, 10.0.16.2:9300, 10.0.16.4:9300] from hosts providers and [{es01}{mkSI0yRYQR-A0nf8zkkFHg}{4zms7RjESe6TA1FqR6flMA}{10.0.16.5}{10.0.16.5:9300}{cdfhilmrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0"
}
I figured this might be an issue of the data needing to be copied back over into es01
because it has a fresh volume. Well, more than 12 hours later, here's the state:
{
"type": "server",
"timestamp": "2022-03-18T13:16:41,071Z",
"level": "WARN",
"component": "o.e.c.c.ClusterFormationFailureHelper",
"cluster.name": "es-docker-cluster",
"node.name": "es01",
"message": "master not discovered or elected yet, an election requires 3 nodes with ids [6impsuPPQiC1PtSgixyOMQ, r0WdgFU9TSSELTO0iTolMw, mkSI0yRYQR-A0nf8zkkFHg], have only discovered non-quorum [{es01}{mkSI0yRYQR-A0nf8zkkFHg}{tRFA5C5MTkuO3UTcfm3EVg}{10.0.17.4}{10.0.17.4:9300}{cdfhilmrstw}, {es02}{WteiVwQ0SVqnztTdDm62dw}{PqXu1DrFQAiCVc1nl1DDhg}{10.0.17.6}{10.0.17.6:9300}{cdfhilmrstvw}, {es10}{r0WdgFU9TSSELTO0iTolMw}{SZUPH4TDRlalJnvaxPIvxg}{10.0.17.7}{10.0.17.7:9300}{cdfhilmrstvw}, {es09}{ALFQG_ufTJWErofgyJgKVA}{-PTYvKF9QHOHbuGkecGg5Q}{10.0.17.10}{10.0.17.10:9300}{cdfhilmrstvw}]; discovery will continue using [10.0.17.6:9300, 10.0.17.10:9300, 10.0.17.7:9300] from hosts providers and [{es01}{mkSI0yRYQR-A0nf8zkkFHg}{tRFA5C5MTkuO3UTcfm3EVg}{10.0.17.4}{10.0.17.4:9300}{cdfhilmrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0"
}
es02
seems to have changed its ID from 6impsuPPQiC1PtSgixyOMQ
to WteiVwQ0SVqnztTdDm62dw
and now I am at a loss for how to proceed to get back to a usable state.
Appreciate any help/feedback, thanks