I need to migrate multiple TB of data to a new cluster. Since this is production data, I’d like to avoid interruption as much as possible which can be tricky based on the volume of data. I have a simple infrastructure with a service that is responsible to write to elastic and multiple services that read from it. It writes to multiple data streams depending on the data. the data is only appended to the current data stream index, it never modify previous data. each data stream has a policy which makes it roll over its index every day.
I need to have minimal interruption on the services. The maximum possible would be 2 days but I’d like to keep it way below.
My plan to migrate is the following :
1 - stop the writing service so no new data is added to any index
2 - rollover every data stream so that the current index is closed
3 - take a manual snapshot
4 - run the restore procedure on the new cluster
5 - while the restore is running, restart the writing service and make it write to both clusters to add new data (no modifications to previous indexes, only new ones) → This is so that the read services can still access production data while the restore is happening.
6 - wait for the restore to finish
7 - make the elastic services only query the new cluster
Step 5 is the one I'm not sure about. I don't know if I will be able to write to the datastream that is being restored on the new cluster even though it does not write to previous indexes.
Does this plan seems ok ? If not what would be the best way to do this that does not require a licence and avoid too long interruption ?
The new cluster will be deployed with ECK and run the latest version available. The current cluster is running on Elastic Cloud in version 8.17.
Both clusters are in different locations since the new one will be on our Kubernetes cluster.
Both clusters will have the same size and topology at first : 4 hot nodes of 60GB memory. The goal is to reduce the size of the new cluster later to be more in phase with the resource usage.
The motivation for this migration is cost. Our company have experience in managing ECK and wants to reduce costs by moving the cluster to Kubernetes (where we can have more control on the cost of the infrastructure).
The new cluster will be deployed with ECK and run the latest version available. The current cluster is running on Elastic Cloud in version 8.17.
OK, that rules out migrating through a stretched cluster.
What is the longest retention period of your data?
If this is reasonably short and you have a message queue in your ingest pipeline it might be an option to feed both clusters separately for a period of time until they hold the same data and then switch over without any downtime at all.
Consider going round a snapshot/restore loop repeatedly first. Snapshots & restores are both incremental operations, so although the first one will take a while the later ones will be quicker as the two clusters get more and more synchronized, until you get to the point where you can run the process you suggest but without needing to start writing in step 5 until the restore is complete.
IMO it’d be simpler to use cross-cluster search tho, leaving the existing data where it is until it ages out while new data accumulates in the new cluster. There’s probably some middle ground where you start with a cross-cluster search setup to do the main switchover and then migrate the older data across using snapshots, which you can do at a more leisurely pace since it’s not on the critical path.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.