Find-master-node-timeout needed

ethrbunny · August 26, 2019, 7:25pm

Cluster has 3 master nodes: I typically bring up 1 as the 'seed' node and then add two others to point to it. Once the nodes have all agreed on who is the boss I kill the 'seed' node and scale up the others to 3.

This morning I wanted to update the config so the non-seed nodes would use persistent storage. The 'seed' master was still in charge so I dropped / restarted the others. Now the 'seed' master won't acknowledge the new wannaba-masters. Errors show that it's looking (in vain) for the original wannabes.

Seems (to me) that there should be a timeout value set on the master to stop looking for other lost masters and permit new entrants. Either that or create a PUT command to do the same.

Note: after 30+ minutes of waiting the errors on the seed-master show that it has found the wannabes but because they aren't in the list of what it's expecting it won't let them join.

Further: messages from the 'wannabes' show that they have found the seed-master but it isn't showing up as eligible.

and this node must discover master-eligible nodes [<seed master>] to bootstrap a cluster:
have discovered <list of nodes including other wannabes _and_ the seed master>...

sigh - now I have to delete the seed-master and rebuild the cluster.. the largest hassle being that all the data/ingest nodes are still pointed at the old (broken) master so the _state folders all have to be tracked down and removed.

DavidTurner · August 26, 2019, 7:56pm

Unfortunately this would be unsafe, i.e. would lead to data loss. The only safe thing to do after losing a majority of master nodes is to remain unavailable indefinitely (or until those nodes reappear).

In fact the whole process you describe sounds pretty unsafe. Master-eligible nodes are required to use persistent storage. If they do not then this is the sort of thing that can happen.

ethrbunny · August 26, 2019, 7:59pm

Yeah - I just learned about the persistent storage for masters this morning. Hence the update.

ethrbunny · August 26, 2019, 8:01pm

RE: _state folder - I can understand why a data node would find a cluster and stick with it but Im not clear on why this value is written to disk. It means that when I rebuild a cluster I have to find / remove this folder otherwise when the data nodes come up they try to (re)find the older cluster-master.

DavidTurner · August 26, 2019, 8:09pm

Same reason: it is unsafe (i.e. results in data loss) to move nodes between clusters.

DavidTurner · August 26, 2019, 8:10pm

To be clear, you absolutely should not be removing anything from within the data folder by hand. There are no user-serviceable parts in there.

ethrbunny · August 26, 2019, 8:16pm

Oh. Dang.

Well - it seems to be the only way to resurrect my cluster (to go into the data folder(s) and clean up stuff).

DavidTurner · August 26, 2019, 8:25pm

If your master nodes had storage that persists across restarts then none of this would be necessary. Since you've fixed the storage on the masters, you need to rebuild this cluster one last time and then everything should become much simpler.

ethrbunny · August 26, 2019, 8:50pm

Every time I go through this I learn / encounter something new.

Today's entertainment - ingest nodes are failing to come up - claim they can't find the master node. (5 data nodes are up and 3 master nodes have agreed with one another)

"type": "server", "timestamp": "2019-08-26T20:50:33,942+0000", "level": "WARN", "component": "o.e.d.SeedHostsResolver", "cluster.name": "elk-cluster", "node.name": "elasticsearch-ingest-0",  "message": "failed to resolve host [elasticsearch-master.elastic.svc.cluster.local]"

Interesting side note: had to bring down a data node to try and clear a stuck kibana_2 alias. When it came back up it started giving the same error msg as the ingest node(s).

Im thinking Im going to have to kill the whole cluster again and keep the seed-master up until all the data and ingest nodes are up. Seems like this will potentially give me fits down the road if I have to restart any of the data/ingest nodes.

DavidTurner · August 26, 2019, 9:57pm

This looks like a DNS issue. Does this DNS name correctly resolve to the addresses of all of your master-eligible nodes?

ethrbunny · August 27, 2019, 11:13am

It's def based on the absence of a seed-master. Once I (re)started that (and subsequently reinitialized the whole cluster) everything resolved and joined properly.

This seems like a bug (to me). Once seed + non-seed masters have formed a coalition nodes (ingest, data, etc) should be able to join regardless of which master node is actually in charge.

Don't believe me? Try it.

Create a seed-master node
Create 2 (or more) non-seed master nodes and join them to the seed-master
Create a non-master node and join it to the cluster - note that it works
Kill the seed-master
Create a non-master node and (attempt to) join it to the cluster - note that it doesn't work

DavidTurner · August 27, 2019, 11:40am

The only way I can think of to reproduce this is by misconfiguring discovery.seed_hosts. You have mentioned this idea of a "seed" node multiple times but you are using this word quite differently from how Elasticsearch uses it. discovery.seed_hosts should refer to all of your master-eligible nodes, but I suspect you are configuring this setting to refer only to the single node that gets killed in step 4. Once that node is dead, the node created in step 5 cannot find the rest of the cluster because it doesn't know any of their addresses.

ethrbunny · August 27, 2019, 10:28pm

I can only have one seed host as Im using kubernetes for the whole thing.

(Starting a new thread to address data loss)

system · September 24, 2019, 10:28pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
After restarting the master node, data and client nodes cannot discover the master Elasticsearch	11	1247	July 12, 2023
Master takes another cluster UUID when recovers Elasticsearch	7	567	July 4, 2021
I lose all my data when master node restarts Elasticsearch	8	981	June 23, 2023
Elasticsearch master pods failing master not discovered or elected yet, an election requires at least 2 nodes Community Ecosystem	7	208	September 24, 2024
Master not discovered error Elasticsearch	9	1489	October 3, 2019

Find-master-node-timeout needed

Related topics