So, my work requires georedundancy for every solution. It's not good enough that things like elasticsearch are not designed for that, and in fact, spanning-datacenter-clusters actually open you to more potential issues. I completely understand and agree with the reasoning behind that. Looks like some people do it anyway: http://gibrown.com/2014/01/09/scaling-elasticsearch-part-1-overview/
However, we still need to have somewhat of a solution. Active/active is preferred, but hot standby is a minimum requirement.
I see a couple ways to approach this: single cluster, and multi-cluster. Single cluster meaning the data nodes in the secondary data center are part of the same cluster. Multi-cluster means there is some synchronization process to make sure the standby cluster has all of the data in the primary cluster. We do have a dedicated link with a huge pipe, but it is safe to assume that the network will fail at some point. There are things like couchbase which can automatically synchronize cluster data (http://blog.couchbase.com/announcing-release-couchbase-plug-elasticsearch), but that seems like quite a bit of technical debt to take on.
Has anyone had experience with running a georedundant solution? Or any avenues for research?
Right now I am leaning toward snapshot/restore, because we'll need to be taking snapshots anyway. We just need a process to automatically restore the snapshots on the primary to the secondary cluster(s). Am I overlooking anything?
Say I have 2gb of data, and take a snapshot. It'll be ~2gb. Then, I index 100mb of data. Take another snapshot, which should be about 100mb.
If I delete snapshot #1, then create another snapshot, is snapshot #3 going to be 2gb? Or does snapshot #2 inherit the files associated with snapshot #1? The docs say this:
When a snapshot is deleted from a repository, Elasticsearch deletes all files that are associated with the deleted snapshot and not used by any other snapshots.
So I am assuming that means that the 2gb would be "used" by snapshot #2 because it was part of the cluster state at the time the snapshot was taken, and thus would not be deleted.
Yep, you cannot restore to an open index. As for closing everything, that is up to you. If the cluster is a standby then it makes sense, you just need to check to make sure you aren't hurt when things need to reallocate.
It might make sense to look at some of the other options I suggested.
My concern with options 2 and 3 are guaranteed delivery and ensuring the two clusters are in sync.
Is there a good solution regarding option 3 that you would recommend? I am not too familiar in this space. If I were designing it, I'd likely write a small app to work from some type of MQ, but there may already be solutions out there for this.
I think something like this is coming down the pike as part of these issues on async replication and the Changes API.
For now, we do #2 and have a watchdog script that checks for consistency between the clusters and can repair things when counts don't match. We don't tend to have network issues between datacenters, so we really only get out of sync once in a while. Having said that, I will be quite happy to ditch this application code workaround when this kind of replication is built into Elasticsearch.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.