I’m looking for clarification regarding the functionality described in the documentation about reading indices from older Elasticsearch versions:
Currently, our process is as follows:
We store indices as snapshots in an S3 repository.
We plan to restore these snapshots into a newer Elasticsearch cluster version.
I would like to confirm:
Are there any version limitations or compatibility constraints when restoring snapshots created in significantly older Elasticsearch versions into newer clusters?
Is the restore process sufficient on its own, or are additional steps (such as reindexing or intermediate cluster upgrades) required for certain version gaps?
Are there recommended best practices for long-term snapshot archival intended for future restores into newer major versions?
Is Elastic planning to maintain this backward-reading capability for old indices in future releases?
If long-term backward compatibility is not guaranteed, what is the recommended strategy to ensure data stored today can still be reliably read (or restored) in, for example, 10 years?
Context: our datasets are at petabyte scale, so performing full reindexing during each major upgrade cycle would be operationally very difficult and costly.
My current assumption is that even if mappings are not fully compatible, the underlying data should still be readable. Please correct me if that assumption is incorrect.
Any clarification, recommendations, or real-world experience would be greatly appreciated.
I can answer a couple of the questions but will leave the rest for someone from Elastic.
This is described in the docs you linked to.
If you look on the subscriptions page I believe this is covered by the feature “Snapshot as simple archives”, which requires a commercial Enterprise level license.
If you do not have the required license you will only be able to restore indices created in the current of previous major version.
Restoring should be sufficient to get the indices into a read-only mode. If you need to make changes or add data I believe reindexing will be required.
Note that this feature supports snapshots taken in versions of Elasticsearch going back to 5.0.0, released October 26 2016, which isn’t far off 10 years ago.
... for indices created in the previous major version
Note it says "previous major version", singular. 9.x can read/write v8-created indices, 8.x can read/write 7.x indices, and so on. There's often threads on here around say upgrading to 9.x with a cluster that has some 7.x-created indices. Often it will involve re-indexing data.
As others have discussed , what license do you have, if any, as this is also important ?
Also, can you clarify what your status is right now? i.e. are we talking about a "10 year plan" starting now? Started a few years ago? Started already 10+ years ago? If the latter, I'm a bit surprised you got so far without thinking about this already
Do you know all the versions for the indices stored in the S3 snapshot repo? How "newer" is newer in the scenario you are considering?
This suggests you have never done a major version upgrade? Is that the case?
From the documentation, I didn’t understand that this feature requires an Enterprise license. Unfortunately, we are currently using only the free/open Basic license.
That’s fine for us — we only need read access to the data.
Unfortunately, we are currently using only the free/open Basic license.
Our oldest indices/snapshots were created/reindex to version 7.x. Our long-term requirement is to be able to access archived data even 10+ years in the future.
We are currently running ES 8.x and plan to upgrade to 9.x this year. Snapshosts are at 7.x or 8.x.
We have performed major upgrades before, but since the last one our data volume has grown significantly and is now at PB scale.
Just to make sure I understand correctly:
Is long-term access to old indices across many major versions only supported with an Enterprise license (for example using snapshot archive functionality), while with a Basic license compatibility is only guaranteed for indices created in the current or previous major version?
And for long-term archival (10+ years), is it recommended to treat snapshots mainly as backups, and rely on periodic reindexing or exporting data into a version-independent format (such as JSON) to ensure future readability?
Yes, the Snapshots as Simple Archives is an Enterprise feature, it requires an Enterprise license.
With a basic license you can only use index created on the current and the previous version.
For example, on version 9.X, it can read indices created on version 9.X and 8.X, if you have an index create on version 7.X or lower you cannot read it.
When you upgrade to version 9.X you will need to restore the snapshots of the indices created on version 7.X and reindex them on version 8.X before you upgrade..
This is up to the user, the only supported way out of the box requires the enterprise license.
With the basic license you would need to make sure that none of your indices will have an unsupported version difference, so you would need to keep restoring and reindexing them before each upgrade.
This could take a lot of time and be very expensive to the point that it is easier and cheaper to get a cluster with an Enterprise license.
This is very much the point. I know this is a community forum and we do our best to offer the same level of support for all users regardless of whether they are Elastic customers or not, so I don’t want this to sound like a sales pitch (and TBC I’m very much not on commission): at the kind of scale you’re talking about I would encourage you to consider getting access to licensed features like searchable snapshots and archive snapshot support. Avoiding the need to reindex a PiB of data every few years will surely make it worthwhile.
With basic license alone, I think it's a personal taste as to what approach to take. You want (N-2).x created indices/snapshots readable on a N.x cluster? Then you will have to re-index at some point, so you are only choosing when and where to do the re-index. If all the 7.x indices/snapshots are now read-only, then IMO you may as well start now, restore and re-index on a 8.x cluster, snapshot again. You could even then do same and convert them into 9.x indices. If all the 7.x stuff is on S3 you could spin up ephemeral "data migration" cluster or clusters to help with this, leaving the prod system alone. "Bookkeeping" could easily get a bit messy, might already be a bit messy!!
Cost wise, you would need to "do the maths" to work out if buying a license would be more cost effective, tho you would be estimating a lot to even approx TCO over next X years. Note a license also brings vendor support and many, many other features. e.g. synthetic source is a licensed feature, so that might also help on storage costs, depending on your data of course. Given what you have written, I would certainly suggest talking to someone at Elastic (noting I don't and never have worked for Elastic).
Depending on the use case there is another feature that requires a license that may be of interest - searchable snapshots. This allows data in the cluster to be stored on S3 and still be available from within the cluster. Maybe this would allow you to technically keep the data in the cluster and greatly simplify management of the data on S3? As it is a licensed feature I do not have much experience with it, but it may be something worth exploring as it could potentially save a lot of time and effort.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.