We had a 3 Node cluster running very well till it stopped running very well. The SSD's of 2 Nodes failed in a timespan of 4 hours. This was in the middle of the night so no actions could be taken. We now had 2 corrupted nodes and one node still up. We though everything was fine because we had 3 nodes and every node had a replica of every index.
The problem was that after a restart of this node it could not elect itself as a master and could not start back up again. No matter what we tried we were not able to get it running. In the end we resorted to using unsafe cluster bootstraping to recover at least some of the data.
But this cant be the "best" way. So what is the actual best practice when 2 of 3 nodes fail? What is the best way to get the data back, to get the cluster back up again? Or is everyone just hoping that this never happens?
Was there any correlation between the failures? E.g. were they running in physical proximity so they might have suffered similar damage due to heat or vibration or some other environmental effect? Were they the same model of drive? The same manufacturing batch perhaps?
There's no watertight protection against multiple failures so it really depends how paranoid you want to be. Physically separating the nodes helps. Decorrelating their hardware helps. RAID helps, especially on the master nodes - dedicated masters don't need much storage but they do need it to be reliable. But ultimately it's all about probabilities and you can still be extraordinarily unlucky. In that case, the manual recommends restoring from a recent snapshot:
If you can’t start enough nodes to form a quorum, start a new cluster and restore data from a recent snapshot.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.