ECE broken after one node lost


I'm currently working on HA tests for ECE before going to production and I just ran into an interesting issue.

I have 4 nodes divided into 2 zones and ECE elasticsearch cluster (the system one) have 2 nodes (one node per zone). I simulated a node failure by deleting all docker containers in one of those two nodes and I can't recover from it even when I install ECE again using an emergency token on the failed node.

I can log into the UI, but it's broken and I can't do anything there. I understand that the system cluster failed because with only 2 zones it had 2 masters and when one of them failed, the whole cluster went down. This was expected. I wonder if there's a way to recover from such state because I believe this edge case could happen in prod as well and I want to be prepared for it.

I realize that having only two zones is risky, but theoretically, this can happen even with 3 zones in prod - one zone in the maintenance, the second one just failed.

My question is - could the ece system cluster be considered a single point of failure and we have to make sure it never fails or is there a way to recover even from this state?

Thanks in advance

The executive summary is that to have supportable HA in ECE we require >2 zones.

If you only have 2 physical zones, you can create a "banana zone" (basically a logical zone which only accepts tiebreaker instances) which at least provides resiliency to allocator outages, though a full zone outage should be assumed to always result in loss of availability (though not data)

Currently (and for the near future), the ECE system cluster only backs the UI (and search requests in the API), so you can always recover it from the API. In the case where it's down you have to _shutdown and _restart it.

The biggest issue is that our persistence store (Zookeeper) requires >2 zones to be HA - it requires slightly messy manual intervention to fix (because the API isn't functional when ZK is down)

(Also the UI doesn't support building HA clusters in 2 zone mode, you have to use the API or handwire in the replica separation logic)

(Finally - there is a "alpha" disaster recovery process which recovers from all zones going down we are sharing on request while we work on improving it - this probably could be adjusted to handle 1/2 zones going down, but it's not recommended)


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.