Current status/recommendation for Georedundant clustering

So, my work requires georedundancy for every solution. It's not good enough that things like elasticsearch are not designed for that, and in fact, spanning-datacenter-clusters actually open you to more potential issues. I completely understand and agree with the reasoning behind that. Looks like some people do it anyway: http://gibrown.com/2014/01/09/scaling-elasticsearch-part-1-overview/

However, we still need to have somewhat of a solution. Active/active is preferred, but hot standby is a minimum requirement.

I see a couple ways to approach this: single cluster, and multi-cluster. Single cluster meaning the data nodes in the secondary data center are part of the same cluster. Multi-cluster means there is some synchronization process to make sure the standby cluster has all of the data in the primary cluster. We do have a dedicated link with a huge pipe, but it is safe to assume that the network will fail at some point. There are things like couchbase which can automatically synchronize cluster data (http://blog.couchbase.com/announcing-release-couchbase-plug-elasticsearch), but that seems like quite a bit of technical debt to take on.

Has anyone had experience with running a georedundant solution? Or any avenues for research?

Right now I am leaning toward snapshot/restore, because we'll need to be taking snapshots anyway. We just need a process to automatically restore the snapshots on the primary to the secondary cluster(s). Am I overlooking anything?

More reading:

https://groups.google.com/forum/#!topic/elasticsearch/t6EISVSJP_g
https://crate.io/docs/en/latest/best_practice/multi_zone_setup.html

We don't recommend cross-DC clusters as any network instability will cause problems.

The current options are;

  1. Snapshot and Restore
  2. Use your application layer to send to both cluster
  3. Do similar to #2 but have a queuing mechanism that replicates to both instead

One thing that is not clear to me on snapshots.

I know they are incremental.

Say I have 2gb of data, and take a snapshot. It'll be ~2gb. Then, I index 100mb of data. Take another snapshot, which should be about 100mb.

If I delete snapshot #1, then create another snapshot, is snapshot #3 going to be 2gb? Or does snapshot #2 inherit the files associated with snapshot #1? The docs say this:

When a snapshot is deleted from a repository, Elasticsearch deletes all files that are associated with the deleted snapshot and not used by any other snapshots.

So I am assuming that means that the 2gb would be "used" by snapshot #2 because it was part of the cluster state at the time the snapshot was taken, and thus would not be deleted.

Trying to formulate a plan before I actually start testing this out, and I'd like to keep the number of snapshots small if I can by pruning old ones (due to Snapshot deletion and creation slow down as number of snapshots in repository grows · Issue #8958 · elastic/elasticsearch · GitHub). But I'd also like to snapshot frequently in order to keep the secondary cluster as up-to-date as possible.

Your assumption is correct.

So, I've got a POC for this, and it appears that when I do a restore, I need to first close open indices being restored.

Is it safe to just do a /*/_close and just let all of the shards re-allocate? Or at least close the indices noted as present in the snapshot?

I am concerned with how long a restore operation will take when there are several TBs worth of data in dozens of indices...

Yep, you cannot restore to an open index. As for closing everything, that is up to you. If the cluster is a standby then it makes sense, you just need to check to make sure you aren't hurt when things need to reallocate.

It might make sense to look at some of the other options I suggested.

My concern with options 2 and 3 are guaranteed delivery and ensuring the two clusters are in sync.

Is there a good solution regarding option 3 that you would recommend? I am not too familiar in this space. If I were designing it, I'd likely write a small app to work from some type of MQ, but there may already be solutions out there for this.

I think something like this is coming down the pike as part of these issues on async replication and the Changes API.

For now, we do #2 and have a watchdog script that checks for consistency between the clusters and can repair things when counts don't match. We don't tend to have network issues between datacenters, so we really only get out of sync once in a while. Having said that, I will be quite happy to ditch this application code workaround when this kind of replication is built into Elasticsearch.