Reason for version downgrade not supported

shivani_aggarwal · August 17, 2020, 8:49am

Hi,

Elastic documentation states that version downgrade is not supported.

https://www.elastic.co/guide/en/elasticsearch/reference/7.8/setup-upgrade.html - In-place downgrades to earlier versions are not supported. To downgrade to an earlier version, [restore a snapshot] taken prior to the version upgrade.

Interestingly, it is not even supported for patch releases like from 7.1.1 to 7.1.0. ( as mentioned by elastic team in a thread - Unwanted Upgrade ElasticSearch version - To be clear, all downgrades are forbidden..)

Can you please help us understand why downgrades are not supported, while upgrades work fine? Are there some technical limitations which prevent downgrade even among patch releases?
A bit more insight into the technical reasons for this limitation would greatly help.
Thanks!

DavidTurner · August 17, 2020, 9:38am

The format of the data on disk often changes between versions, rendering it unreadable by earlier versions. This is even true between patch releases where a bugfix necessitates a new file format. Upgrades are supported by preserving the logic for reading old data formats and adding new formats alongside them, but maintaining support for writing old data formats would be onerous, often impossible, so downgrades cannot be supported.

jbaker · August 18, 2020, 11:34am

Suppose one were to procure an 'intermediate' release of Elasticsearch which sat between two tags and backported binary format changes but nothing else. Would this be a viable strategy for enabling limited rollbacks? You would separate out binary format changes (which are seemingly what disables rollbacks) and any behavioural changes (regressions (or unexpectedly unhandled breaks) in which tend to produce hard-to-resolve outages).
We are presently going through an extremely expensive migration from ES2 to ES6 in which all of our code now understands how to index and query into both search engines and can flip back and forth (this was useful for when we e.g. discovered differences in NaN behaviour). We're looking towards ES7 and ES8 and so forth and are trying to figure out a simpler migration path.

DavidTurner · August 18, 2020, 12:17pm

Most changes to file formats are intimately related to the corresponding behavioural changes so I don't see how you'd get one without the other. Thinking of things like adding a completely new data type, I don't think you can reasonably even define what it would mean to downgrade after that sort of change.

Also you'd be running a very nonstandard build so the heavy burden of testing it would fall on you and you alone. Seems pretty risky IMO.

ES2 was long before my time, and is very old indeed (2½ years past the end of its supported life) so I can't comment on the specific problems you're currently facing there. Also it's a long jump from ES2 to ES6, it's much better to upgrade only one major version at once. This is the documented recommendation in 7.x but I think that's also true for older versions too.

We put a lot of care into backwards compatibility at the REST API level with the goal that you can run a single version of your client application against different versions of Elasticsearch. The intention here is that you can focus on one upgrade at a time, either Elasticsearch or the client, and don't need to think too hard about cross-version compatibility even across a major version upgrade. If you find a situation where you can't do that, it's probably a bug, please report it.

DavidTurner · August 18, 2020, 1:16pm

You had some more specific questions in the other thread you opened too that make sense to answer here:

I'd assume that the translog files could potentially be a problem, but I'm not sure how frequently the format of those gets changed (I would guess infrequently compared to Lucene). Likewise, cluster state seems like it could be problematic.

FWIW technically yes, the translog format is pretty stable since it's so simple. The point is it's not guaranteed to be stable, so if you have a process built around that assumption then it will break, normally at the worst possible time, and then you're hosed. Other things like the cluster state are definitely much less stable.

It used to be the case that downgrades would in fact often work (or at least would be silent about whatever bits didn't work) and people really would build processes on that basis and then get upset when they stopped working despite all the docs saying not to do this. Since #41731 we actively block all downgrades, even where it might have been ok, to prevent others from falling into this trap.

jbaker · August 18, 2020, 5:51pm

That all makes sense, I think. This is certainly not a path we'd like to go down, especially not lightly, we're weighing up our options. For some extra context on where we've had problems:

In terms of cross version compatibility; we're aware of this, and from 2-6 we struggled particularly with the removal of string and parent child relationships. We could have gone through 5.6, but I think we'd still be reticent to do an in-place upgrade (it would have gone awfully) because
it's sometimes quite hard for us to disentangle an ES release from a release of our client, because in a couple of cases our developers have unwittingly exposed subtle parts of the ES API into our own API (stuff like whether NaNs index without failing, or the ES field count limit) which has sometimes meant that we've learned about an ES break when our own API has broken in ways we never meant to expose (Hyrum's Law in action). When you hit this, not having an exit strategy is rather painful. It's generally ok if it's only our code that needs to be fixed, but if we ship a change where we need to notify our customers and have them change their configurations, that obviously leaves a sour taste.
This is obviously epic tech debt, which we've now paid down, but we found a couple of places where engineers on one team had accessed the index of another team. Inevitably some of them did not add support for the new ES version, we flipped onto the new version, and everything broke for them.
We've occasionally (I can think of twice) had issues with performance regressions. As one example, we had one ES upgrade where after a few days performance completely died, but predictably only on our largest and most important cluster (https://github.com/elastic/elasticsearch/pull/56708). In the end, we tracked it down, but it took a while, and in the meantime we'd nuked the cluster and started over. It's a scary prospect to be considering fully reindexing dozens of terabytes in order to restore service (thankfully this was during the ES2-6 migration and so we had a hot ES2 cluster to fall back on). I think in the other case we just ate degraded performance for a month on one of our clusters, but I don't remember the details.

Because of this, in general our strategy's been 'real life is messy, bring a shovel'. From ES2 to ES6 we ran two clusters in parallel, which was expensive from a coordination and dev perspective but was the only viable approach at the time given our environment (maybe we could have gone through ES5, but that'd be a different kettle of fish which isn't obviously easier given the issues we saw in practice). It meant we could do things like have internal users run all their queries through ES6. We're looking at reducing the coordination burden while maintaining our ability to roll back (which we found to be a development accelerant).

In terms of some changes literally just not being forwards compatible (e.g. adding a new data type) our process is that we don't use new features until we know we won't want to roll back (we gate the ES APIs our devs can use). Ideally if you don't use a feature in the new format, you have a viable rollback path. But, clearly this might not work for every upgrade, and certainly not every way of writing an upgrade. We also use a fairly sparse set of ES features (e.g. no scripts, reindexes, sql, etc).

In terms of the safety things, yeah, those make a lot of sense - we're mostly investigating what the work would look like right now (probably looking to see what it would have been to get from 6.0 to 6.1 or 6.8 to 7.0 (the changes for both of those are beefier then 6.7 to 6.8)).

Hopefully that's useful for you and should explain a bit more about why we're considering a path like this (and to be clear, it's very much a science-project can we make our lives easier thing).

But, to make sure I understand... If we were to go down this path, am I right in thinking:

Generally what we would care about are PRs that add new indexing features, of which there are typically many around a new release. But really what we'd care about are changes to classes that interact with Lucene, or the metadata files, and specifically, not new features, but pre-existing features that are changed to behave differently (and in an incompatible way).
Are there things I'm missing with the above?
Or lucene upgrades.
Upgrades that change the cluster state storage would also be a part of this, seems like those would be much harder to workaround given you'd effectively have to implement a backwards migration between e.g. ES6 and ES7, but at least you can update master and data nodes independently.
If the translog format is changed, we would also care about this, but that should change less frequently and so can be more of a sanity check.

My expectation is that this is probably pretty viable for later minor versions (e.g. 6.6 vs 6.5 looks fairly tractable) but it looks much trickier for major versions, and that's where the wins are.

Thank you for your replies!

DavidTurner · August 18, 2020, 7:11pm

I don't think it sounds viable at all, you're basically proposing maintaining your own fork of Elasticsearch. It would take an extraordinary amount of effort to do as you propose, given the pace at which things moves forward even in minor versions. There's no sense in which you can remain forwards-compatible simply by avoiding certain features, and it's extraordinarily likely you'll end up with your data in a format that simply cannot be turned back into something that the official Elasticsearch can work with.

IMO you'd be enormously better off spending your time on improving your processes, focussing particularly on testing around upgrades, allowing you to follow the recommended upgrade path with confidence. "Everything broke when we flipped onto a new version" is entirely avoidable, and picking up performance regressions early is possible too (we missed #56708 because it only bites in scenarios we don't formally benchmark ourselves).

shivani_aggarwal · August 19, 2020, 1:04pm

Thanks!
Acc to the elastic docs, the suggestion is - to backup the data using snapshot before upgrade. Reinstall an empty cluster of the version before the upgrade, and restore its contents from a snapshot.

Actually we deploy elk stack in k8s environment using helm charts. Here, elk is a part of a bigger umbrella helm chart that deploys other applications too. I can backup the data before upgrade. But, in order to achieve downgrade(rollback), we cannot simply delete the helm release as it would delete other applications too.

Is there a way we could still achieve downgrade without a reinstall?
For ex. if before triggering downgrade, we could delete some data?

Lets say, we have a helm release installed with elk 7.0 version chart that can be upgraded to 7.8 version. We also want to achieve helm rollback between the two revisions of the helm release. Any thoughts on how we could do this in helm?

system · September 16, 2020, 1:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is it possible to downgrade a cluster without any data loss? Elasticsearch	11	13553	November 19, 2019
Rollback ES upgrade Elasticsearch	3	1571	July 6, 2017
What makes ES not support rollbacks? Elasticsearch	3	443	September 15, 2020
Upgrade from 0.90.5 to 0.90.9 Elasticsearch	6	350	July 6, 2017
Issue while Downgrading Elasticsearch from 1.7.3 to 1.5.2 Elasticsearch	5	673	September 18, 2019

Reason for version downgrade not supported

Related topics