Theory question: How to recover data from 3 node cluster after 2 nodes failed

defalt · July 21, 2022, 9:13am

Long time no see, but hello again.

We had a 3 Node cluster running very well till it stopped running very well. The SSD's of 2 Nodes failed in a timespan of 4 hours. This was in the middle of the night so no actions could be taken. We now had 2 corrupted nodes and one node still up. We though everything was fine because we had 3 nodes and every node had a replica of every index.

The problem was that after a restart of this node it could not elect itself as a master and could not start back up again. No matter what we tried we were not able to get it running. In the end we resorted to using unsafe cluster bootstraping to recover at least some of the data.

But this cant be the "best" way. So what is the actual best practice when 2 of 3 nodes fail? What is the best way to get the data back, to get the cluster back up again? Or is everyone just hoping that this never happens?

Thanks for your ideas

warkolm · July 21, 2022, 10:33pm

What version are you on?

defalt · July 22, 2022, 6:28am

7.17.3

DavidTurner · July 22, 2022, 8:07am

Was there any correlation between the failures? E.g. were they running in physical proximity so they might have suffered similar damage due to heat or vibration or some other environmental effect? Were they the same model of drive? The same manufacturing batch perhaps?

There's no watertight protection against multiple failures so it really depends how paranoid you want to be. Physically separating the nodes helps. Decorrelating their hardware helps. RAID helps, especially on the master nodes - dedicated masters don't need much storage but they do need it to be reliable. But ultimately it's all about probabilities and you can still be extraordinarily unlucky. In that case, the manual recommends restoring from a recent snapshot:

If you can’t start enough nodes to form a quorum, start a new cluster and restore data from a recent snapshot.

system · August 19, 2022, 8:08am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
2 Nodes crashed, how to get last Node up an running Elasticsearch	5	239	July 29, 2022
How to enable cluster with 3 master node and lost 2 nodes to work again Elasticsearch	5	317	June 26, 2022
Recover a broken 3 node elasticsearch cluster that has only 1 node left Elasticsearch	6	2175	September 12, 2020
Elasticsearch filesystem recovery? Elasticsearch	4	191	June 1, 2023
How to recover cluster when 2 master nodes have been lost Elasticsearch	4	1659	May 5, 2021

Theory question: How to recover data from 3 node cluster after 2 nodes failed

Related topics