Source cluster: Elasticsearch 7.x ( live on production )
Target cluster: Elasticsearch 8.x
Data volume: ~1 TB across ~100+ indices
Write throughput: ~53 requests/sec, ~1,100 docs/sec ( bulk syncs via op type “index” )
Use Case / Objective:
We intend to migrate from the ES7 cluster to a new ES8 cluster by using the following approach:
Enable dual‑write from the application to both ES7 (primary) and ES8 (secondary) such that all new documents and updates go into ES8 in real‑time.
After the dual‐write has been active for a sufficient time, take a snapshot of the ES7 cluster. Restore the snapshot into ES8 under suffixed indices (e.g., {original_index_name}_dump)
Run a merge process ( via script ) where for each document from index_dump we compare _id and updated_at in the live ES8 index. If the snapshot document’s updated_at is greater than the live one (or the doc does not exist), we index it; otherwise skip.
After merge and validation, delete the *_dump indices and retire the old cluster.
Specific Questions for Elastic Support:
Is this strategy supported or recommended from Elastic’s perspective (for ES7 → ES8 migration)?
Are there any known limitations, caveats or unsupported behaviours in ES8 (or cross version restore) when restoring large volumes with this “merge by timestamp” logic?
Are there performance or reliability risks (e.g., rewrite loops, version conflicts, segment merge issues) specific to such a dual‐write + snapshot+restore + conditional update flow?
Is there any official documentation, case study or best practice published by Elastic (or community) that follows this pattern exactly (dual‑write first, then snapshot‑restore, then conditional merge)?
Are there better alternative approaches recommended by Elastic for this scenario (1 TB data, high throughput, many indices) that would reduce risk or technical debt?
Additional Context / Constraints:
We do not use partial updates: every application update issues a full document index operation.
Document _id space is deterministic and consistent between ES7 and ES8.
We already have an updated_at field on every doc.
We prefer not to block production writes for the migration window ( since multiple clients are using the same for updates )
Please advise on how we should prepare our cluster (both source and target) and what settings or monitoring we should apply during the migration to ensure safe completion.
Please note that this is a community forum and not a support forum. There is no guarantee that you will hear from Elastic support at all.
Which exact version are you using?
Why not perform an in-place rolling upgrade instead as that likely would be a lot easier (even if it were to require multiple upgrade steps)?
This sounds like a completely custom approach that I doubt anyone else have tested or documented. It may work, but it is on you to ensure that you test and verify the validity of the approach properly.
If you are looking for a supported and recommended approach I believe you should perform an in-place rolling upgrade. Unless you have indices created in Elasticsearch 6.x this should be relatively simple and not require reindexing if you just move to Elasticsearch 8.19.
I am not seeing where queries are going during the various steps?
How would that happen? If its happened, does that not indicate something went wrong with the parallel write ?
Is there a specific reason you are eschewing the in-place rolling upgrade most people (I think?) use?
Though it should not really matter, can you add your cluster topology for the 7.x and 8.x - how many nodes of each node type? Are you using the upgrade as an opportunity to (eg) change some characteristics there?
Yes, the current ES version is 7.17 which was updated via rolling upgrade from ES 6.8.
planning to upgrade it to 8.17
Thanks, Christian , fair call. just realize this is a community forum, not a direct line to Elastic Support. But hey, if any seasoned Elasticsearch travelers out there have walked the path of dual-writes, snapshot restores, and timestamp-based merges… I’d love to hear what worked (or blew up). (:
For queries: during the entire migration, queries continue to go to ES7 only. We treat ES8 as a passive dual-write target and snapshot replay sink.
Merge process: my bad , yes the case won't happen , script flow would be -
Iterate through each document in {originial_index_name}_dump
If _id does not exist in live index, insert it
If _id exists, compare modifiedTime
If snapshot’s modifiedTime < ES8's → skip ( probably , this is not even required ) if it exists it would definelty be from the recent write.
Inplace upgrade: We considered the in-place rolling upgrade path, but intentionally chose a parallel cluster strategy for these reasons:
to decouple the migration from production traffic for risk isolation , the new ES8 cluster will be validated and monitored before full cutover.
legacy data includes indices originally created in ES6.x, so even with 7.17, many of them would still require reindexing post-upgrade, defeating the simplicity of a rolling upgrade.
new cluster approach gives us full rollback control , we don’t compromise production writes during the upgrade window.
For Deletions: : our data model uses soft deletes via a boolean field (deleted: true/false), so , Every deletion is actually a full document reindex with deleted = true.
Given that our ES7 cluster functions as a multi-tenant backend supporting live client workloads, we cannot afford to pause writes during the migration window. This constraint necessitates a migration strategy that supports continuous ingestion without disrupting downstream systems.
Note that the reindexing would need to take place before the in-place upgrade. This could be done on a per index basis but would require some period of writes/updates/deletes being halted or queued for each index. As you seem to have a lot of reasonably small indices I would expect the time period the index would need to be read-only to be reasonably short though.
As you are indexing complete documents I suspect it would be reasonably easy to implement such a queueing mechanism, and this may help also outside of the migration in the future. The only downside with this approach would be that it would result in somewhat stale data for each client during a time period. Like the other approach it is however easy to roll back, but it also has the benefit of being a lot easier to ensure consistency and completeness.
If you go with your approach I do not see the point with restoring a snapshot into the new cluster. As you will need to do this using a custom script, why not just connect to both clusters and read from ES7 and write to ES8?
I am a bit unsure of myself now - an index created in 6.x, used (but not upgraded) and then snapshot-ed by 7.x, can that snapshot of that index be directly restored in 8.x ?
Valid point. for this to proceed I’ll need to reindex the 6.x-origin indices in the 7.x cluster before snapshotting.
Given that, I’m now leaning toward dropping the snapshot-restore path altogether and instead using our source-of-truth data (MongoDB) to backfill any documents not captured via dual-write. That gives us a cleaner and more consistent way to handle the delta.
Elasticsearch can only read indices created on a direct previous major version, so on version 8 you can only read indices created on version 7, not on version 6.
Newer versions, like 8.19 can read snapshots with indices created on version 6 and 5 if the cluster has an Enterprise license.
This was introduced in 8.3 as described in this post
Indeed. Your confidence had me questioning myself !!
Still not really buying why not do rolling in-place version upgrade(s). Did the 6.x to 7.x upgrade go wrong in some way that you’ve thought “never again!” ?
That said, you know your data and environment way better than any of us. Risk averse/cautious is good. Thinking about rollbacks is always good. Trying to re-baseline onto a clean 8.x installation is also a pretty good idea, given 9.x is already out since a while. It’s just … maybe a bit harder.
I was thinking about this and have a slightly different approach to the dual cluster approach that I think is a bit easier.
Introduce a message queue, e.g. Kafka, that handles all writes, updates and deletions.This adds flexibility in that you can have multiple consumers indexing into different clusters independently and should only add a little latency during normal operations.
Have a consumer that indexes into the ES7 cluster continously. Create a separate consumer that will index into the ES8/ES9 cluster separately, but do not activate this. This consumer need to be able to handle conflicts based on timestamps/versions as you described earlier.
Set up a new and empty ES8/ES9 cluster.
When you want to start data migration to the new cluster, take a note of the current offset of the queue so you know what has already been processed.
Start creating indices in the new cluster using reindex from remove from the ES7 cluster. This will reindex based on the current state of each index.
Once all indices have been created, set the offset to the recorded value and let the ES8/ES9 consumer start processing the backlog of operations from this point. Once it has caught up all writes will be going into both clusters in near real time.
Check that everything is looking good and switch over queries at any point.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.