Elasticsearch migration and rollback strategy (snapshot and restore)

We are planning to migrate data of Elasticsearch cluster from AWS to new cluster set up in GCP (both clusters are self managed). Steps are as follows:

  1. Full snapshot AWS ES cluster
  2. Restore the full snapshot on GCP ES cluster
  3. Incremental snapshot on AWS ES cluster
  4. Restore the incremental snapshot on GCP ES cluster
  5. Take a full backup on the GCP after the incremental restore is done.
  6. Point the application to GCP ES cluster.

If things doesn't work as expected on the destination (GCP), our rollback strategy is:
A. Take a incremental snapshot on GCP based on the full snapshot (step 5).
B. Apply the GCP incremental snapshot on the AWS cluster.
C. Point the application back to AWS ES cluster.

Does this strategy work? If not, why it won't work?

Thank you.

All snapshots are full snapshots, not incremental, and contain all data indices in the cluster unless the indices to snapshot are explicitly defined.

What are you hoping to achieve by taking 2 snapshots? If it is to reduce the time it takes to create the second snapshot that part will work as Elasticsearch will reuse segments that have not changed from previous snapshots.

The writes will be going to destination cluster (GCP) after the initial cut off. We need the delta data from GCP to to be applied on AWS for rollback. So, during the rollback we are planning to take a incremental snapshot on GCP and apply it on AWS to reduce the downtime when we point our app back to the source cluster (AWS).

As Elasticseach does not support incremental snapshots the way you describe, both snapshots you take will be full snapshots. Your approach will therefore likely need to change.

Are you indexing into all indices or are you e.g. using rollover, so that a lot of the indices are read-only?

There is incremental snapshot. Please refer:

No, that is incorrect.

Every snapshot is a full snapshot and you restore complete indices as they existed at the time the snapshot in question was taken. If you look in the docs there is no option to selectively restore parts of an index based on some criteria. The incremental aspect of it is that segments (which are immutable) that have already been snapshotted in previous snapshots are not copied every time but rather reused. Once you have taken a first snapshot only new segment files are added and the repository can therefore grow slower if a lot of new segments are not created.

Did you look at the link to the very old blog post I linked to in the topic you mentioned? The implementation of the snapshot and restore process has improved over the years and been made more resilient, but most of the fundamental principles are largely still the same.

I have gone through the document you posted in the link I posted. Thank you for the reference. I would not refer to incremental snapshot.

I have the following question for expanding my knowledge:

I have two snapshots taken on a cluster X 1 hour apart. Let's call it snapshot1 and snapshot2. The total size of all indices is 100 GB when snapshot1 is taken. The total size of all indices is 105 GB when snapshot2 is taken. When I restored the snapshot1 to a cluster Y (empty cluster with no indices), it took me 1 hour (approximately). Cluster Y has no new data inserted (it's not accessed by anyone or any application) and I applied the snapshot2 to cluster Y after closing all the indices and it took 10 minutes to complete. Why snapshot1 took 50 minutes more than snapshot2?

The second snapshot is a full snapshot as you can restore it without first restoring any other snapshot. This is what I meant with it not being incremental. You can only restore the full index at the state at the time it was taken and not restore subsets of documents. I also believe it will overwrite any changes made to the source index. The copying of segments is however incremental in nature as they are reused if they have not changed, so that would explain why restoring the second snapshot is faster as a lot of unchanged segments are already in place.

Given that you have a stable source cluster and are not modifying the restored indices in any way between the snapshots being restored you are seeing a large increase in restore speed. I am not sure (someone will hopefully correct me if I am wrong) this is guaranteed under all circumstances though. As snapshots are based on segments and segment merging as far as I know is not coordinated across primaries and replicas it may be possible that segment composition changes a lot between the snapshots, e.g if the original cluster suffered any node failures or other issues between the two snapshots that caused replicas to be promoted to primaries (could possibly cause a new set of segments to be snapshotted).