Can't restart two-master cluster

Hi, I need a little bit of help here.

I have a two master-eligible nodes cluster (yeah, now I know) and did the following:

  • stopped elasticsearch on both nodes
  • moved a copy from an unimportant index to a backup
  • deleted the index directory from both nodes
  • started elasticsearch

At this point i got an error that the nodes cannot start because they know of an index that doesn't exist anymore. I've then recovered the index on one machine and was able to start it, but now the cluster failed because you need both machines (recently migrated to 7.1).

So i copied the index to the other machine thinking it will start with the copy i from the other node. It did not. And since i deleted the folder on that node, i can't start it.

So this is the pickle: can't start one node because it doesn't want to start without the index, and the cluster doesn't start because the second node won't start. It's a circle.

Next, i have started the second machine with an empty data dir thinking it would sync when the cluster is live. But somehow both nodes wait for a node id that doesn't exist, although they see each other:

[2019-07-29T10:17:47,020][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ES1] master not discovered or elected yet, an election requires two nodes with ids [mgFa73WPSVSg3edz5Zmfqg, oloM9SZBRtSRBXjMV4uLFA], have discovered [{ES2}{oloM9SZBRtSRBXjMV4uLFA}{2mEpzKqFQB2PFvVwE6K8SA}{192.168.3.41}{192.168.3.41:9300}{ml.machine_memory=135144009728, ml.max_open_jobs=20, xpack.installed=true}] which is not a quorum; discovery will continue using [192.168.3.41:9300] from hosts providers and [{ES1}{mgFa73WPSVSg3edz5Zmfqg}{r61aRxtBT3id96EkDDi-3A}{192.168.3.40}{192.168.3.40:9300}{ml.machine_memory=135144009728, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0

[2019-07-29T10:18:12,197][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ES2] master not discovered or elected yet, an election requires a node with id [DwJ0DZvNTAatJvpFK_Otqw], have discovered [{ES1}{mgFa73WPSVSg3edz5Zmfqg}{r61aRxtBT3id96EkDDi-3A}{192.168.3.40}{192.168.3.40:9300}{ml.machine_memory=135144009728, ml.max_open_jobs=20, xpack.installed=true}] which is not a quorum; discovery will continue using [192.168.3.40:9300] from hosts providers and [{ES2}{oloM9SZBRtSRBXjMV4uLFA}{2mEpzKqFQB2PFvVwE6K8SA}{192.168.3.41}{192.168.3.41:9300}{ml.machine_memory=135144009728, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 7, last-accepted version 648 in term 7

Any idea how to solve this? There has to be a way to recover from a missing index if you have a copy of it.

It looks to me like this node, ES1, has discovered all the nodes it needs to hold an election: it is mgFa73WPSVSg3edz5Zmfqg and it discovered ES2 with ID oloM9SZBRtSRBXjMV4uLFA. The "which is not a quorum" is a reporting bug, fixed in 7.3.0 (#43316).

It is also perhaps telling that this node reports node term 0, last-accepted version 0 in term 0 which means this node has never fully joined a cluster.

Where did the node with this ID go? Node IDs are stored persistently in the data path so, put differently, where did the data path for this node go?

What do you mean you copied the index to the other machine? Can you share the exact API calls you used?

Wait, what? On first reading I didn't see this. You manually deleted stuff from out of the data path?

Yeah. Actual folder delete. With rm -rf.
Bottom line, i think it's not normal that index loss or corruption to render a node useless with no obvious/quick/easy way to recover.

I ended up using elasticsearch-node unsafe-bootstrap on the "good" node and reinstalled from scratch elasticsearch on the other.

What I would like to know for the future:

  • how to recover a node with a faulty / missing index. If posible.
  • how to remove a node from a cluster when the cluster doesn't start.

That is because of the file path change.

Can you clarify why you think this? Elasticsearch offers absolutely no guarantees if you manually alter the contents of the data directory. Please do not ever do this. If there's some misleading docs that are giving you the impression that this is a reasonable thing to do then please could you share a link so we can fix them?

Fault tolerance is generally expected to happen at the node level: a node fails, you write it off completely, build a new empty node to replace it and let that new node recover its contents from the redundant copies of the data elsewhere in the cluster. Moreover you need at least three master-eligible nodes before you can have any expectation of fault tolerance: recovery from a fault in a two-node cluster is generally expected to happen by building a whole new cluster and restoring data from snapshots.

Elasticsearch has a certain amount of tolerance for certain common hardware failure modes, allowing it to handle these more gracefully than simply replacing the whole node. Deletion of a whole index directory is not remotely such a failure mode.

1 Like

I guess I'm making a parallel with similar systems. Database servers don't break when a db or a table or replication brakes.

The main reason i did this is because backup is so damn complicated in elasticsearch, and this method worked in earlier versions.