How do I migrate Elasticsearch Data Nodes by Moving Virtual Disks Between VMs

Hi all,

I have a production Elasticsearch cluster v 7.10.2 with 4 data nodes and 3 master nodes. Most of our indices have 4 shards and 1 replica. We perform a rollup of indices nightly at 3 AM.

I’m planning to migrate each data node to a new OS by deploying new VMs one at a time. The data disks are virtual disks in our hypervisor environment, and my plan is to detach the virtual disk from the old VM and attach it to the new VM hosting the new OS.

I have a local tiny cluster that I've been trying to test the behavior on, but I'm still a bit scared. Is there anything that could go wrong? Is there a better way to do this?

Just to be clear, the new VM will be given same node name/IP as that which it will replace?

You must ensure the virtual disk is never mounted on 2 VMs at same time.

There’s a few different ways you might approach this, it’s more of a matter of sysadmin taste which they prefer. You are taking right approach to test on a small non-operational cluster first.

@RainTown
Thanks for looking into my issue!
I’m aware of mounting and have some experience migrating disks between VMs.
My question is: what issues could arise if one of the four nodes disconnects and then comes back online after 1–5 minutes? Will the indexes be affected in any way?

Well, cluster state will go yellow, shards change state (replica/primary), move around (preventable), clients might see transient errors. All of these , and others, are manageable. Whether those are “issues” is semantics.

Whatever node you are working on will (likely) contain some indices’ primary shards, and replica shards for other indices. Those shards would be unavailable.

You didn’t answer the question:

If yes, experience should be similar as when you have applied OS patches and rebooted, or upgraded cluster software version.

Yes, the machine will get the same hostname and IP address. I was planning to do the work before the moment when the new indexes are created, so that the new indexes would already start writing into the cluster that consists of all the nodes.
How can I reduce the time the indexes stay in yellow status and time of shards unavailable?

You probably want to disable , or tune, shard allocation for short periods.

I’d likely NOT do this operation near time when new indices would be created by an application or ILM policy or …

Again, if your prep is good this isn’t a difficult task. You are cautious (good!) and asking right questions. You have a test cluster to tune the process.

The risks are mostly on sysadmin error side, wrong permissions, not matching UIDs, duplicate IPs, etc. if you avoid these it’s not much different than a node rebooting.

This doesn't really matter much. If you attach the disk to a new node with a different name or IP then Elasticsearch will work out what to do.

Follow the instructions in the reference manual for the rolling restart process.

Again, this shouldn't really matter.

Did you make some progress? Do you have further questions?