I’m struggling to repetitively backup and restore efficiently indices between two independent ES clusters as this is causing a lot of shards rebalancing.
The setup is as follows:
ES 8.19 cluster composed of 11 nodes located in DC1
up to date Minio cluster located in DC1
ES 8.19 cluster composed of 3 nodes located in DC2
DC1 cluster has the minio cluster defined as the snapshot backend in read/write
DC2 cluster has the minio cluster defined as the snapshot backend in read only
After and initial massive sync of the data between those clusters, i try to keep them as synced as possible (DC1 => DC2) with as much regularity as possible to be able to get a clear RPO
Which is really not the case with the unpredictible shard rebalancing each restore operation seems to be causing
To do that, an ansible playbook is scheduled to run every 30min
if both clusters are green and DC2 has no “undesired shards” (GET /_cat/shards?v) about to be relocated
at DC1, take a snapshot of the next 1 to 13 indices
8 to 49 shards total
with almost exclusively 4 primary shards per index
and a total index size (primary+replica) of 140GB to 1TB
at DC2, for each indices we’ve just snapshotted : closing then restore
Extra info about the clusters :
240 indices for 70TB disk usage
720 primary shards
The issue is that the DC2 cluster spends sometimes >12 hours rebalancing dozens of shards (sometimes 15 to 30 per node) just after restoring up to 13 indices in < 1h (so not a huge diff of data copied)
How is it possible that restoring even a few gigabytes or even a few hundred megabytes to existing shards causes so much mess in the cluster ?
Is that normal/something some of you experienced ?
Any clues what to do to get a more reliable schedule of sync to get a clear RPO?
I assume you don't want to pay the license to use CCR?
Can you confirm number_of_replicas on DC1 and DC2 (e.g. do you restore with 0 replicas, then increase it afterwards?) ? And was this issue always so, and if not when did it start ?
I ask because there was somewhere in 8.x release cycle that it seems the "balancing" algorithm was changed, and this led to more shard movements to "balance" what at first glance appeared to be already-balanced clusters. There have been a couple of threads on here about this since I've been active on the forum. Here's one.
My recollection is that tho you can try to tune a cluster to be less prone to shard movements, but it's a bit of a dark art, and the algorithm seems a bit opaque (well, the code is open-source, so that's not quite fair, but I don't think the "controls" are the clearest documented, IMO).
Did you try doing the restores exactly one-index-at-a-time, and if so does it lead to the same shuffling shard behavior?
I hope one of the other regular posters is able to provide ideas/suggestions here, as I could see how frustrating the reported behaviour must be.
btw, a bit of a wild idea if nothing else pans out, but if your DC2 cluster had just one (beefy) node, then it would not be necessary to move any shards around. OK, you'd have no replicas, but other cluster nodes could be added as part of the recovery process. You could even have 2x single-node DC2 clusters.
Yes your assumption is right, paying the license to use CCR is not on the table right now
I’ll go down this trail of “maybe a minor update (8.x) changed something overnight in my cluster rebalancing behavior” as it felt there were no issues for months while this has been running.
But to be fair i can’t tell when it started to “go bad”, i just didn’t look into the unbalanced % at start. I just happened to see warning logs that it’s been >10% (up to 22%) so now i care about it.
So i decided to explore various tweaks with my backup-restore algo and schedule, but with no good outputs.
Good point made here by @DavidTurner . One can say that the rebalancing affects my RPO because I chose to when i’ve started to tune my algo to wait for “undesired shards” to be either <10%, 5%, 0% before restoring the next index.
If i don’t, it seems i can never get something stable as shards can’t move fast enough to catch up
I already restore exactly one-index-at-a-time, and it has always been a loop of “close index, restore”, close another index, restore it…
Even though it’s not possible in my case, the “one beefy” node approach is interesting thanks for mentioning it
Almost certainly the fix for this is to tune threshold to be higher.
Going one-by-one through indices makes life much harder for the balancer: each restore moves the goalposts. Elasticsearch expects you to just carry on regardless, there’s no need to consider shard movements when restoring snapshots. I’m not really surprised it takes ages if you wait for the movements to stop between each index.
After doing something major to the cluster like a snapshot restore, it might make sense to execute DELETE /_internal/desired_balance. This happens automatically when something “significant” happens in the cluster already (e.g. a node leaving) but snapshot restores aren’t considered significant in this sense. Yet, if you’re restoring a lot of indices at once then there might be some benefit to it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.