Migrate data to a new cluster with minimal interruption

Hi,

I need to migrate multiple TB of data to a new cluster. Since this is production data, I’d like to avoid interruption as much as possible which can be tricky based on the volume of data. I have a simple infrastructure with a service that is responsible to write to elastic and multiple services that read from it. It writes to multiple data streams depending on the data. the data is only appended to the current data stream index, it never modify previous data. each data stream has a policy which makes it roll over its index every day.

I need to have minimal interruption on the services. The maximum possible would be 2 days but I’d like to keep it way below.

My plan to migrate is the following :

1 - stop the writing service so no new data is added to any index
2 - rollover every data stream so that the current index is closed
3 - take a manual snapshot
4 - run the restore procedure on the new cluster
5 - while the restore is running, restart the writing service and make it write to both clusters to add new data (no modifications to previous indexes, only new ones) → This is so that the read services can still access production data while the restore is happening.
6 - wait for the restore to finish
7 - make the elastic services only query the new cluster

Step 5 is the one I'm not sure about. I don't know if I will be able to write to the datastream that is being restored on the new cluster even though it does not write to previous indexes.

Does this plan seems ok ? If not what would be the best way to do this that does not require a licence and avoid too long interruption ?

Will both clusters be running the same version of Elasticsearch or does the migration to new hardware include version change?

Is the new cluster running in the same location and /or same type of hardware?

What is the reason that drives the move to a new cluster?

What is the size and topology of the current and target clusters?

The new cluster will be deployed with ECK and run the latest version available. The current cluster is running on Elastic Cloud in version 8.17.
Both clusters are in different locations since the new one will be on our Kubernetes cluster.
Both clusters will have the same size and topology at first : 4 hot nodes of 60GB memory. The goal is to reduce the size of the new cluster later to be more in phase with the resource usage.

The motivation for this migration is cost. Our company have experience in managing ECK and wants to reduce costs by moving the cluster to Kubernetes (where we can have more control on the cost of the infrastructure).

The new cluster will be deployed with ECK and run the latest version available. The current cluster is running on Elastic Cloud in version 8.17.

OK, that rules out migrating through a stretched cluster.

What is the longest retention period of your data?

If this is reasonably short and you have a message queue in your ingest pipeline it might be an option to feed both clusters separately for a period of time until they hold the same data and then switch over without any downtime at all.

How many TB are you migrating? less then 10? 50? 100?

Since this is append only, for how long do you keep your data and do you search on old data or basically most searchs are on new data?

Also, what is the license level in the new cluster and what will be the license level on the new cluster?

Consider going round a snapshot/restore loop repeatedly first. Snapshots & restores are both incremental operations, so although the first one will take a while the later ones will be quicker as the two clusters get more and more synchronized, until you get to the point where you can run the process you suggest but without needing to start writing in step 5 until the restore is complete.

IMO it’d be simpler to use cross-cluster search tho, leaving the existing data where it is until it ages out while new data accumulates in the new cluster. There’s probably some middle ground where you start with a cross-cluster search setup to do the main switchover and then migrate the older data across using snapshots, which you can do at a more leisurely pace since it’s not on the critical path.