I am trying to come up with a configuration for disaster recovery for our elastic stack.
We have two data-centers(DC1 and DC2) in separate cities and both will be up and running at the same time. If a problem occurs that prevent DC1 from serving, we want DC2 to handle all the traffic. To do that we have to find a way to sync these two Elasticsearch clusters.
I've found Cross Cluster Replication however it does not allow active/active synchronization. It just copies one indices to another cluster and does not allows update operations on the replicated side.
I came across some paid solutions like opster but I would like to know if there's any other way to set this up.
Active-Active replication of a single index is a very hard problem to solve and I am not aware of any solution that offer this for Elasticsearch. One way to get around this might be to write to both clusters in parallel, e.g. by doing this in the client or by serializing updates and inserts onto a message queue and then apply them separately in the correct order. This is however not easy either as it can be tricky to guarantee ordering (consistency) and handle temporary cluster ouitages or connectivity issues.
I would recommend instead using CCR in a active-passive approach and always write to one of the clusters.
So in that case if the DC hosting the leader index goes down, Only option I have to continue working is to stop the CCR and change the follower to a normal index. Right? If that's the case, this is a lot of manual work, it doesn't helps with the disaster recovery for someone who is using elastic both read/write intensive.
There is no perfect solution and there will IMHO always be some tradeoff with respect to consistency and/or availability. Assuming complete DC failure is a rare occurance I would probably sacrifice a bit of availability in favour of maintaining consistency as this only occurs in the rare event of a DC failure and does not affect normal operations. You will however need to determine what tradeoff can/cannot accept.
There may however be another option. If the DCs are close and have good connectivity and low latency you might be able to deploy a stretched cluster. This might resolve the problem, but the increased latencies could naturally affect performence and it may be less reliable. You would also need a third site/DC to hold nodes or at least a tiebreaker.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.