I'm currently working on HA tests for ECE before going to production and I just ran into an interesting issue.
I have 4 nodes divided into 2 zones and ECE elasticsearch cluster (the system one) have 2 nodes (one node per zone). I simulated a node failure by deleting all docker containers in one of those two nodes and I can't recover from it even when I install ECE again using an emergency token on the failed node.
I can log into the UI, but it's broken and I can't do anything there. I understand that the system cluster failed because with only 2 zones it had 2 masters and when one of them failed, the whole cluster went down. This was expected. I wonder if there's a way to recover from such state because I believe this edge case could happen in prod as well and I want to be prepared for it.
I realize that having only two zones is risky, but theoretically, this can happen even with 3 zones in prod - one zone in the maintenance, the second one just failed.
My question is - could the ece system cluster be considered a single point of failure and we have to make sure it never fails or is there a way to recover even from this state?
Thanks in advance