Due to unfortunate operator error the PVs of the master nodes were deleted, due to a out of disk error from the master nodes.
We had 3 dedicated master nodes and 12 data nodes.
The data nodes are now can not locate the master after provisioning new master nodes (and yes that node set was scaled down to zero sadly).
We have additionally thought about making 3 of the data nodes master eligible, and if possible triggering a master election... although we aren't sure if this is the right path.
By far the best option would be to restore from a snapshot into a new cluster. If that is not possible you could try the following:
spin up new master nodes (I think you have done that already)
set cluster.initial_master_nodes explicitly to the new names of the new master nodes (this will generate a warning in the logs as this setting is usually managed by ECK)
now for the dangerous part. Please note that the following can lead to data loss and should only be your measure of last resort:
convince yourself that you have correctly configured cluster.initial_master_nodes and the new masters have formed a new cluster
now edit the spec of your data nodes adding an init container that will detach the data nodes from the old cluster and allow them to join the newly formed cluster. Again this is where you can lose data. I used the following container in an experiment:
Make sure that the new pod specification with the addtional init container is used in the corresponding stateful set. If not wait for the operator to update the stateful set before you continue.
Now start with one of the data nodes and delete the corresponding pod. The pod will then be recreated by the stateful set controller using the new spec.
Check the pods logs and verify that it successfully joins the new cluster before proceding to repeat the same procedure with the other data pods.
Once all nodes have joined the new cluster remove the init container (important otherwise the pods will detach every time they restart) and also the cluster.initial_master_nodes setting again.
I suggest to go through the procedure on a test cluster first to verify it works for your setup, before you try to apply to the cluster that has lost its master nodes.
[EDIT] I have simplified the instructions a bit. The cluster.initial_master_nodes setting is not necessary on the data nodes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.