ECE Disaster Recovery how it works


(Tim Arp) #1

Hi,
I'm working on creating the infrastructure for ECE at AWS on EC2.
I have 3 different Autoscale Groups (Coordinators, Allocators, Proxies). EC2 instances come and go and I think I got this case handled. In my current design the only group that has scale up/down policies is the proxy group. All ASGs have a minimum of 3 servers (1 in each AZ).

What happens if someone runs my script to accidentally destroy the setup? Besides firing the individual.
I don't see where anything for ECE is persisted. Should I be running a snapshot on EC2?

I get that the cluster's data would be in snapshots in S3. But how do I rebuild the whole ECE environment like it was?

Thanks,
Tim


(Alex Piggott) #2

Hi @tarp

We don't have a great story around disaster recovery at the moment - it is an item on our roadmap to improve

There's 3 parts to "rebuilding the environment":

  • All the servers with the right preparations (XFS etc) - we recommend creating AMIs or similar to be able to recreate these easily
  • All the runners, roles etc .... this can be automated using the --roles / --runner-id etc options when installing
    • (there are a currently just couple of things like the root cname and the license that aren't covered vyt this)
  • All the clusters ... you covered the cluster data, that leaves the cluster metadata - see below

If you take snapshots of the data in the Zookeeper servers then with help from deep support it would be possible to get a system up and running again with the same cluster metadata (cluster ids etc), then you could reapply the plan together with the "restore from snapshot". It wouldn't be a fun few hours certainly...

Alex


(Tim Arp) #3

Hi Alex,
I have the build-out 100% done with Terraform. So it's the configuration I'm concerned about. I would like to talk more about this. Our support contract is close to being established but not yet. I would like to get some more detail about a process we could follow for DR. Can you elaborate on the "snapshots of zookeeper data"?

Can you clarify something for me. I see that the zookeepers are on only the coordinator/directors. If all coordinator/director servers are down, is ECE down and in a DR situation?

thanks,
Tim


(Alex Piggott) #4

Can you elaborate on the "snapshots of zookeeper data"?

Zookeeper maintains a snapshot of its in-memory data on disk. So if you put the master node's snapshot (basically /mnt/data/elastic/:allocator/services/zookeeper/data) somewhere safe then you can restore it, with some caveats (they are discussed here - the TLDR is that while it's not mathematically sound, it's good enough for DR purposes)

Can you clarify something for me. I see that the zookeepers are on only the coordinator/directors. If all coordinator/director servers are down, is ECE down and in a DR situation?

The proxy in ECE will continue to run for some period after the directors go down (we have an issue to make that period be indefinite, which is how we run it in our SaaS via a config change), the clusters will run forever (though of course their state cannot be changed)

If the co-ordinators go down then the system state cannot be changed but the existing clusters will continue running indefinitely

Directors being down/deleted is the worst case since you can always bring a new co-ordinator up easily (provided you safely store the secrets file generated as part of the original install). Fixing down directors is always a slightly manual process


(system) #5

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.