Is it possible to downgrade a cluster without any data loss?

I have an upgraded the Elasticsearch cluster with version 7.3.2 and now I want to downgrade it to the version 6.8.1. Is it possible to make it happen without any data loss? Live data is pushing to the cluster.

No. The only way to downgrade is to start up a new cluster and restore a snapshot taken before the upgrade. Elasticsearch does not support downgrading.

No supported way. Here are some of my observations:

  • If you have only upgraded master nodes and not data nodes, there's a way to roll back. I have tried this and worked. See this stackoverflow thread

  • If you have upgraded everything, I guess there ain't any way if its a major version upgrade (your case). For minor version upgrade, you could do a rolling downgrade if you have multiple replicas. Meaning, you delete data from one of the nodes and restart the node. Make sure you roll back master nodes first so that you form a cluster (only if you have dedicated master nodes). I haven't tried any of this, but it might work since the indexes should be backward compatible (don't try this on prod :slight_smile: ).

The linked post on Stackoverflow does not describe a way to roll back without data loss although it may only result in silent data loss in the versions mentioned there. I think it will not work at all when downgrading from 7.3.2 since protection was recently added to prevent this kind of unsafe activity.

No, this won't work either. Indices are backwards-compatible but not forwards-compatible, and forwards-compatibility is what we need here. If you try and join older nodes to an upgraded cluster then they will not be assigned any shard copies.

Is there anyway to downgrade with the backup of data directory instead of snapshot?

No, there is not. You cannot take a backup of your data directory - at least, you cannot safely restore from such a backup, even into the same version. From the docs:

WARNING: You cannot back up an Elasticsearch cluster by simply taking a copy of the data directories of all of its nodes. [....] If you try to restore a cluster from such a backup, it may fail and report corruption and/or missing files. Alternatively, it may appear to have succeeded though it silently lost some of its data. The only reliable way to back up a cluster is by using the snapshot and restore functionality.

Why do you want to downgrade?

1 Like

The linked post on Stackoverflow does not describe a way to roll back without data loss although it may only result in silent data loss in the versions mentioned there. I think it will not work at all when downgrading from 7.3.2 since protection was recently added to prevent this kind of unsafe activity.

Yeah, I have no experience on 7.3.2. But I did downgrade from 6.1 to 5.6 a couple of weeks back successfully. This wasn't documented, but I had to give it a try (see the reason below, it might help ES with some feedback):

  • Take one master node down
  • Delete the state files under /nodes/0 (data dir of dedicated master node)
  • Start the master with 5.6
  • Wait for it to join the cluster (it actually did)
  • Repeat the above steps for rest of the master nodes

These were dedicated master nodes, BTW. Reason below (fun part):

We had an existing configuration to increase the max_compilations_per_min in our cluster settings. While upgrading the cluster, it started complaining after completing dedicated master upgrades that the configuration is deprecated, so you need to unset the configuration from cluster settings in order to complete the upgrade. However, there ain't anyway in Elastic to unset this particular cluster setting (I am surprised). We spent good amount of time on it, but nothing worked. At this point, I couldn't upgrade the cluster nor rollback (at least per documentation). I had to do something since this was our production cluster. So, I ended up trying the above steps (that's the only logical thing I could think of then).

I suspect that could have been avoided if you had attempted and validated the upgrade first in a test cluster rather than take a leap of faith directly in production?

For the benefit of other readers, the process you've described is super-risky and very much not recommended. It's possible you have lost some data there and just haven't noticed yet. I think you also ignored a bunch of steps in the upgrade instructions to get into that state. The setting you mention, script.max_compilations_per_min was properly deprecated in 5.6 and would have resulted in warnings prior to your upgrade.

Finally, I think there are ways to get out of that state without a risky rollback (e.g. unsetting all cluster settings with:

PUT /_cluster/settings
{
  "persistent": {
    "*": null
  },
  "transient": {
    "*": null
  }
}
2 Likes

@Christian_Dahlqvist Yeah, that would have been ideal. We obviously tested things in test environment and things looked alright. We have over 30 clusters and there was one cluster where we had to update this setting (a long time back). Hence, we couldn't catch this in test.

@DavidTurner
We tried setting the persistent and transient settings to null. Didn't work. Someone has filed a git issue for it and I believe ES decided to close it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.