Elastic Stack OS Migration

Hello ,
We have a requirement where we need to migrate our ES cluster from CentOS 7 to Rocky 9.

Cluster Details :-

  • CentOS 7.9 Nodes
  • 15 Nodes (3 master node, 12 data nodes)
  • Stack Version 8.12.2 might upgrade to 8.14.3 before migrating to Rocky 9

Is there a recommended way to achieve that without losing data.

I thought the following steps might get the job done.

  1. run elasticsearch-node detach-cluster which will remove the node from the cluster.
  2. Upgrade the detached node OS to Rocky 9 via a clean install or Migration tools
  3. Install Elasticsearch on the new server
  4. rejoin the newly upgraded server to the cluster as a new node.

is this the way to go? and if so how should i handle the master nodes, should i upgrade 3 data nodes first and make them master-eligible nodes followed by the rest of the data nodes then the old master nodes? or should i upgrade one of the existing master-eligible nodes to Rocky 9 earlier than the others allowing it to take control until i get the other 2 masters also upgraded?

any suggestion would be appriciated.

To avoid losing data the recommendation is to always have snapshots, so the first thing you should do is check if you have snapshots of the data you cannot afford to lose.

I would not use the elasticsearch-node tool unless required, this tool is used to performance unsafe operation that can lead to data loss, this is mentioned in the beginning of the documentation.

What you want can be done in a safer way, it just may take some time to finish.

Do you have spare resources to spin-up some extra nodes? It is not clear if you are using VMs or bare metals nor if you have spare resources.

If you want to upgrade before changing the OS you need to follow the rolling upgrade documentation, so you should first upgrade the data nodes and then the master nodes, the master nodes are always the last ones that should be upgraded.

Please provide more context about your cluster, if you have spare resourcs to spin-up new nodes or not so a better suggestion can be made.

Please provide more context about your cluster, if you have spare resourcs to spin-up new nodes or not so a better suggestion can be made.

Our current cluster is bare metal (15 physical servers) and we can't add any more resources at the moment due to budget issues.

If you have any recommendation on how to do the OS migration in-place and without downtime that would be greatly appreciated.

If you do an in-place OS upgrade, then that normally looks just like a rolling restart to Elasticsearch (perhaps with a longer time between shutdown and restart), provided:

  • You upgrade the OS in place without removing packages, mounted disks, etc.
  • The downtime on each node is minutes to hours rather than days.
  • You aren't trying to change ES versions/config at the same time

If those things are true, then you can consider doing this as a rolling restart.

If you want to avoid downtime then you'll need to either:

  1. Make sure you have replicas on all shards (so that the node you shutdown isn't removing the only copy of the shard data)
  2. Use shard allocation awareness to force shards off the node before you shutdown

Option 1 is definitely the preferred option.

You'll also need to decide whether to disable shard allocation during the upgrade. Typically for a rolling upgrade we recommend that you disable allocation so that the cluster doesn't try and move shards around - if the restarted node is going to rejoin soon then it's almost always faster and cheaper to just let it bring the shards back online than to reallocate them on a different node. But, disabling shard allocation means that you will have 1 unassigned replica, which reduces data resilience. If your OS upgrade is going to take hours, then you might decide to keep shard allocation on so that your cluster maintains the intended number of shard copies.

However, if you decide to do a clean install of the new OS (including reinstalling Elasticsearch) then that's no longer a rolling restart - the new node will have a new identity. You have to treat each step of that upgrade as a "remove node from cluster" and "add node to cluster".

The main issue here is that I do not know if there is a supported in-place migration from CentOS 7 to Rocky 9.

You would be not only changing distros, no matter how similar they are, but also jumping 2 versions, a lot has change and a lot of things can break or require manual intervention to fix.

I would do a clean install in this case.

With a clean install you would need to do as Tim mentioned, treat each step as a remove node and add node, this can be done with downtime, it just will take sometime to finish the entire process depending on how much data you have in your cluster.

For each data node you would need to do the following steps:

  • Exclude node from allocation as mentioned in this example in the documentation.
  • Wait until the node is empty
  • With the empty node you can stop the elasticsearch service, perform a clean install of rocky 9, install and configure elasticsearch to join your cluster and start it
  • Allow allocation to the node again.
  • Wait for the cluster to rebalance the shards and then repeat the steps on the next node.

Are your master nodes dedicated masters? If they are dedicated masters you do not need to remove then from allocation as they do not have data, besides the internal metadata about the cluster and node, so you just stop the node, perform a clean install, install elasticsearch and add them to the cluster again.

One issue here is that with two masters only you are at risk of one of them failing and your cluster going down, to help avoid this you could make a couple of your data nodes as master eligible during this process and after you have your 3 masters running on rocky 9 you could them remove the master role for those data nodes (requires a restart).

The big problem is that all those steps can take a long time to complete, also, to empty one node you need to have enough space in your other nodes for the data to be moved out from it.