Can't restart two-master cluster

Jordan_Ex · July 29, 2019, 10:36am

Hi, I need a little bit of help here.

I have a two master-eligible nodes cluster (yeah, now I know) and did the following:

stopped elasticsearch on both nodes
moved a copy from an unimportant index to a backup
deleted the index directory from both nodes
started elasticsearch

At this point i got an error that the nodes cannot start because they know of an index that doesn't exist anymore. I've then recovered the index on one machine and was able to start it, but now the cluster failed because you need both machines (recently migrated to 7.1).

So i copied the index to the other machine thinking it will start with the copy i from the other node. It did not. And since i deleted the folder on that node, i can't start it.

So this is the pickle: can't start one node because it doesn't want to start without the index, and the cluster doesn't start because the second node won't start. It's a circle.

Next, i have started the second machine with an empty data dir thinking it would sync when the cluster is live. But somehow both nodes wait for a node id that doesn't exist, although they see each other:

[2019-07-29T10:17:47,020][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ES1] master not discovered or elected yet, an election requires two nodes with ids [mgFa73WPSVSg3edz5Zmfqg, oloM9SZBRtSRBXjMV4uLFA], have discovered [{ES2}{oloM9SZBRtSRBXjMV4uLFA}{2mEpzKqFQB2PFvVwE6K8SA}{192.168.3.41}{192.168.3.41:9300}{ml.machine_memory=135144009728, ml.max_open_jobs=20, xpack.installed=true}] which is not a quorum; discovery will continue using [192.168.3.41:9300] from hosts providers and [{ES1}{mgFa73WPSVSg3edz5Zmfqg}{r61aRxtBT3id96EkDDi-3A}{192.168.3.40}{192.168.3.40:9300}{ml.machine_memory=135144009728, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0

[2019-07-29T10:18:12,197][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ES2] master not discovered or elected yet, an election requires a node with id [DwJ0DZvNTAatJvpFK_Otqw], have discovered [{ES1}{mgFa73WPSVSg3edz5Zmfqg}{r61aRxtBT3id96EkDDi-3A}{192.168.3.40}{192.168.3.40:9300}{ml.machine_memory=135144009728, ml.max_open_jobs=20, xpack.installed=true}] which is not a quorum; discovery will continue using [192.168.3.40:9300] from hosts providers and [{ES2}{oloM9SZBRtSRBXjMV4uLFA}{2mEpzKqFQB2PFvVwE6K8SA}{192.168.3.41}{192.168.3.41:9300}{ml.machine_memory=135144009728, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 7, last-accepted version 648 in term 7

Any idea how to solve this? There has to be a way to recover from a missing index if you have a copy of it.

DavidTurner · July 29, 2019, 12:13pm

It looks to me like this node, ES1, has discovered all the nodes it needs to hold an election: it is mgFa73WPSVSg3edz5Zmfqg and it discovered ES2 with ID oloM9SZBRtSRBXjMV4uLFA. The "which is not a quorum" is a reporting bug, fixed in 7.3.0 (#43316).

It is also perhaps telling that this node reports node term 0, last-accepted version 0 in term 0 which means this node has never fully joined a cluster.

Where did the node with this ID go? Node IDs are stored persistently in the data path so, put differently, where did the data path for this node go?

What do you mean you copied the index to the other machine? Can you share the exact API calls you used?

DavidTurner · July 29, 2019, 12:14pm

Wait, what? On first reading I didn't see this. You manually deleted stuff from out of the data path?

Jordan_Ex · July 29, 2019, 12:26pm

Yeah. Actual folder delete. With rm -rf.
Bottom line, i think it's not normal that index loss or corruption to render a node useless with no obvious/quick/easy way to recover.

Jordan_Ex · July 29, 2019, 12:30pm

I ended up using elasticsearch-node unsafe-bootstrap on the "good" node and reinstalled from scratch elasticsearch on the other.

What I would like to know for the future:

how to recover a node with a faulty / missing index. If posible.
how to remove a node from a cluster when the cluster doesn't start.

Jordan_Ex · July 29, 2019, 12:31pm

That is because of the file path change.

DavidTurner · July 29, 2019, 12:46pm

Can you clarify why you think this? Elasticsearch offers absolutely no guarantees if you manually alter the contents of the data directory. Please do not ever do this. If there's some misleading docs that are giving you the impression that this is a reasonable thing to do then please could you share a link so we can fix them?

Fault tolerance is generally expected to happen at the node level: a node fails, you write it off completely, build a new empty node to replace it and let that new node recover its contents from the redundant copies of the data elsewhere in the cluster. Moreover you need at least three master-eligible nodes before you can have any expectation of fault tolerance: recovery from a fault in a two-node cluster is generally expected to happen by building a whole new cluster and restoring data from snapshots.

Elasticsearch has a certain amount of tolerance for certain common hardware failure modes, allowing it to handle these more gracefully than simply replacing the whole node. Deletion of a whole index directory is not remotely such a failure mode.

Jordan_Ex · July 30, 2019, 6:27am

I guess I'm making a parallel with similar systems. Database servers don't break when a db or a table or replication brakes.

The main reason i did this is because backup is so damn complicated in elasticsearch, and this method worked in earlier versions.

system · August 27, 2019, 6:27am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Deleted index shows up after cluster restart Elasticsearch	5	1200	July 6, 2017
I have two corrupted indexes that just won't delete Elasticsearch	7	6196	July 6, 2017
Bad recovery after cluster restart Elasticsearch	4	409	July 6, 2017
Index deleted in Cluster Elasticsearch	1	240	July 6, 2017
ElasticSearch on EC2 - runs into problem recovering when one of the nodes times out then recovers Elasticsearch	2	352	July 6, 2017

Can't restart two-master cluster

Related topics