Best Practice for Keeping Separate DC and DR Elasticsearch Open-Source Clusters in Sync

Hi Team,

We have two separate Elasticsearch Open-Source clusters running on Linux servers:

DC Cluster
6 nodes
3 Dedicated Master nodes
3 Data nodes
Active production cluster

DR Cluster
6 nodes
3 Dedicated Master nodes
3 Data nodes
Separate cluster located in the DR site

Currently, only the DC cluster is active. During DR drill activities, we stop the DC cluster and bring up the DR cluster. We are now looking for a better architecture where both clusters remain synchronized so that the DR cluster is always ready for failover with minimal data loss.

Our objective is to maintain both Elasticsearch Open-Source clusters in sync continuously and have a seamless switchover during DR drills or in the event of a disaster.

Questions
Is it recommended to keep both DC and DR clusters running simultaneously in an active-passive setup?
What is the recommended approach to keep two independent Elasticsearch Open-Source clusters synchronized in near real-time?
Has anyone implemented a similar DC-DR architecture with Elasticsearch Open Source? If yes, could you please share your architecture and operational experience?
Are there any best practices or reference architectures for maintaining synchronized DC and DR Elasticsearch Open-Source clusters?

Environment
Elasticsearch Open Source
Linux servers at both DC and DR sites
Separate clusters
DC: 3 Dedicated Master nodes + 3 Data nodes
DR: 3 Dedicated Master nodes + 3 Data nodes
Requirement: Near real-time synchronization and seamless DR failover

Any suggestions, best practices, or references from the community would be greatly appreciated.

Thank you.

Not trying to make a plug but with a platinum license Elasticsearch will do this for you with cross-cluster replication. I say this mostly because this is a hard problem to get right particularly at large scale and often is one of things you really only know works when it all goes wrong. I've had my share of middle of the night trying to recover clusters myself so I mention this first.

If you want to do it self managed without a license then you can probably take pointers from that guide too. But here's my 2 cents and other folks are welcome to chime in.

With all good things in life it will depend on your tolerance for data lose and downtime. I think there's 2 or 3 things I'd immediately think about:

  • If you want a true active-passive setup with no lose and minimal downtime then yes I would run both simultaneously with the passive cluster following the active one similar to the setup you've described. You will want a service that will write to the passive in duplicate and not return until both clusters have been written to. There can be gotchas here with making sure both clusters are in sync say when the service that's writing goes down unexpectedly but they are workable such as using an async audit mechanism.
  • On the other side of that coin you can spend as little as possible while still being ready for disaster. You could not run a passive cluster at all and be prepared to launch it in a separate region on demand. This likely assumes you are taking regular backups to some kind of blob storage like S3. And then on demand you'd launch the cluster, restore the backups, and restore service. Likely there will be some data lose since your last backup that dictates you'd replay lost data when convenient (and that you'd still have an audit somewhere of the last successful write and last successful backup time). For small indices and clusters this can be very fast. For larger clusters and lots of data to restore it is likely unpalatable.
  • Anything in between is possible and entirely depends on your requirements for recovery.

Some of these things get easier in the near future too with things like the stateless architecture for Elasticsearch (it's what backs our Serverless offering so think k8s launch on demand recovery where ES data lives by default in S3 so no data lose and minimal downtime) but that's not a right now thing and it's not clear what your timeline is.

Happy to bat ideas around with you or dig into something specific.

The easiest way is with Cross-Cluster Replication, but this requires a License, you would need to get an Enterprise License (Platinum licenses were retired according to the subscription page).

Without it the 2 main approaches would be:

  • Dual ingestion, ingest the same data in both clusters at the same time
  • Use Snapshot and Restore on demand.

Both will depend on which data and how you ingest it currently and the requirements regarding dowtime etc.