ECE broken after one node lost

osykora · February 7, 2019, 4:05pm

Hi,

I'm currently working on HA tests for ECE before going to production and I just ran into an interesting issue.

I have 4 nodes divided into 2 zones and ECE elasticsearch cluster (the system one) have 2 nodes (one node per zone). I simulated a node failure by deleting all docker containers in one of those two nodes and I can't recover from it even when I install ECE again using an emergency token on the failed node.

I can log into the UI, but it's broken and I can't do anything there. I understand that the system cluster failed because with only 2 zones it had 2 masters and when one of them failed, the whole cluster went down. This was expected. I wonder if there's a way to recover from such state because I believe this edge case could happen in prod as well and I want to be prepared for it.

I realize that having only two zones is risky, but theoretically, this can happen even with 3 zones in prod - one zone in the maintenance, the second one just failed.

My question is - could the ece system cluster be considered a single point of failure and we have to make sure it never fails or is there a way to recover even from this state?

Thanks in advance

Alex_Piggott · February 7, 2019, 5:37pm

The executive summary is that to have supportable HA in ECE we require >2 zones.

If you only have 2 physical zones, you can create a "banana zone" (basically a logical zone which only accepts tiebreaker instances) which at least provides resiliency to allocator outages, though a full zone outage should be assumed to always result in loss of availability (though not data)

Currently (and for the near future), the ECE system cluster only backs the UI (and search requests in the API), so you can always recover it from the API. In the case where it's down you have to _shutdown and _restart it.

The biggest issue is that our persistence store (Zookeeper) requires >2 zones to be HA - it requires slightly messy manual intervention to fix (because the API isn't functional when ZK is down)

(Also the UI doesn't support building HA clusters in 2 zone mode, you have to use the API or handwire in the replica separation logic)

(Finally - there is a "alpha" disaster recovery process which recovers from all zones going down we are sharing on request while we work on improving it - this probably could be adjusted to handle 1/2 zones going down, but it's not recommended)

Alex

system · February 21, 2019, 5:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How important to have 3 zones for a cluster? Elastic Cloud Enterprise (ECE)	4	4112	March 28, 2018
Not available when you only have two zones. See Fault Tolerance in the documentation to learn more.? Elastic Cloud Enterprise (ECE)	7	1232	July 19, 2018
Elasticsearch AWS availability zone awareness Elasticsearch	4	5359	July 6, 2017
ElasticSearch on EC2 - runs into problem recovering when one of the nodes times out then recovers Elasticsearch	2	360	July 6, 2017
Network failure resiliency Elasticsearch	14	766	July 6, 2017

ECE broken after one node lost

Related topics