Is there a way to determine what triggers shard movement?

Our cluster all of a sudden starts to rebalance for no apparent reason. It's been going on for a day.
I tweaked the setting "cluster.routing.allocation.balance.threshold" to 10 which means allow 10 shards difference between nodes.
It's a cluster of around 90 data nodes. The cluster look to be pretty balanced.

It has been running fine without issue for years. How would I debug this?

Any pointer will be highly appreciated.

We did update Ubuntu recently, but it should have nothing to do with this. This shard movement doesn't align with the OS upgrade either. It happens several days (weeks?) after.

What version are you running? This is important information as it may be some issue related to a specific version.

I would recommend that you revert this change as this can increase your issue.

Also, can you share the result of running the following request in Kibana Dev Tools?

GET /_cat/nodes?v&h=name,role,disk.used_percent,disk.used,disk.avail&s=role

version 8.10.2

I just reset the threshold to null. We'll see if that helps.
But that value was recommended from the online document to give larger cluster some slack so it won't rebalance too frequently.
Plus that value was changed long time ago.

name                               role disk.used_percent disk.used disk.avail
es8-data-157-production.xyz.pvt   d                45.58     2.4tb      2.9tb
es8-data-111-production.xyz.pvt   d                45.34     2.4tb      2.9tb
es8-data-142-production.xyz.pvt   d                41.16     2.2tb      3.1tb
es8-data-152-production.xyz.pvt   d                43.85     2.3tb        3tb
es8-data-129-production.xyz.pvt   d                41.31     2.2tb      3.1tb
es8-data-110-production.xyz.pvt   d                44.25     2.3tb      2.9tb
es8-data-126-production.xyz.pvt   d                45.57     2.4tb      2.9tb
es8-data-122-production.xyz.pvt   d                43.33     2.3tb        3tb
es8-data-141-production.xyz.pvt   d                41.53     2.2tb      3.1tb
es8-data-162-production.xyz.pvt   d                43.91     2.3tb        3tb
es8-data-105-production.xyz.pvt   d                45.50     2.4tb      2.9tb
es8-data-106-production.xyz.pvt   d                43.23     2.3tb        3tb
es8-data-133-production.xyz.pvt   d                42.98     2.3tb        3tb
es8-data-156-production.xyz.pvt   d                45.09     2.4tb      2.9tb
es8-data-145-production.xyz.pvt   d                42.47     2.2tb        3tb
es8-data-168-production.xyz.pvt   d                43.59     2.3tb        3tb
es8-data-100-production.xyz.pvt   d                41.37     2.2tb      3.1tb
es8-data-149-production.xyz.pvt   d                46.00     2.4tb      2.8tb
es8-data-109-production.xyz.pvt   d                44.00     2.3tb        3tb
es8-data-150-production.xyz.pvt   d                45.47     2.4tb      2.9tb
es8-data-155-production.xyz.pvt   d                46.31     2.4tb      2.8tb
es8-data-146-production.xyz.pvt   d                43.44     2.3tb        3tb
es8-data-173-production.xyz.pvt   d                46.81     2.5tb      2.8tb
es8-data-140-production.xyz.pvt   d                43.47     2.3tb        3tb
es8-data-177-production.xyz.pvt   d                40.50     2.1tb      3.1tb
es8-data-176-production.xyz.pvt   d                41.08     2.2tb      3.1tb
es8-data-134-production.xyz.pvt   d                44.06     2.3tb        3tb
es8-data-137-production.xyz.pvt   d                47.67     2.5tb      2.8tb
es8-data-103-production.xyz.pvt   d                45.05     2.4tb      2.9tb
es8-data-117-production.xyz.pvt   d                45.60     2.4tb      2.9tb
es8-data-174-production.xyz.pvt   d                46.08     2.4tb      2.8tb
es8-data-115-production.xyz.pvt   d                43.82     2.3tb        3tb
es8-data-172-production.xyz.pvt   d                46.88     2.5tb      2.8tb
es8-data-121-production.xyz.pvt   d                44.75     2.4tb      2.9tb
es8-data-175-production.xyz.pvt   d                40.44     2.1tb      3.1tb
es8-data-148-production.xyz.pvt   d                43.91     2.3tb        3tb
es8-data-161-production.xyz.pvt   d                47.89     2.5tb      2.7tb
es8-data-104-production.xyz.pvt   d                44.04     2.3tb        3tb
es8-data-116-production.xyz.pvt   d                42.89     2.3tb        3tb
es8-data-147-production.xyz.pvt   d                44.04     2.3tb        3tb
es8-data-130-production.xyz.pvt   d                45.58     2.4tb      2.9tb
es8-data-125-production.xyz.pvt   d                42.75     2.2tb        3tb
es8-data-118-production.xyz.pvt   d                42.21     2.2tb      3.1tb
es8-data-144-production.xyz.pvt   d                42.93     2.3tb        3tb
es8-data-166-production.xyz.pvt   d                45.07     2.4tb      2.9tb
es8-data-139-production.xyz.pvt   d                42.37     2.2tb        3tb
es8-data-127-production.xyz.pvt   d                43.28     2.3tb        3tb
es8-data-113-production.xyz.pvt   d                46.19     2.4tb      2.8tb
es8-data-165-production.xyz.pvt   d                42.31     2.2tb        3tb
es8-data-171-production.xyz.pvt   d                42.56     2.2tb        3tb
es8-data-158-production.xyz.pvt   d                44.49     2.3tb      2.9tb
es8-data-167-production.xyz.pvt   d                44.75     2.4tb      2.9tb
es8-data-163-production.xyz.pvt   d                43.14     2.3tb        3tb
es8-data-131-production.xyz.pvt   d                42.97     2.3tb        3tb
es8-data-108-production.xyz.pvt   d                44.27     2.3tb      2.9tb
es8-data-120-production.xyz.pvt   d                44.62     2.3tb      2.9tb
es8-data-124-production.xyz.pvt   d                45.17     2.4tb      2.9tb
es8-data-159-production.xyz.pvt   d                44.49     2.3tb      2.9tb
es8-data-101-production.xyz.pvt   d                43.68     2.3tb        3tb
es8-data-114-production.xyz.pvt   d                45.38     2.4tb      2.9tb
es8-data-112-production.xyz.pvt   d                43.80     2.3tb        3tb
es8-data-135-production.xyz.pvt   d                43.98     2.3tb        3tb
es8-data-128-production.xyz.pvt   d                42.36     2.2tb        3tb
es8-data-151-production.xyz.pvt   d                44.19     2.3tb      2.9tb
es8-data-119-production.xyz.pvt   d                42.49     2.2tb        3tb
es8-data-136-production.xyz.pvt   d                45.57     2.4tb      2.9tb
es8-data-153-production.xyz.pvt   d                46.78     2.5tb      2.8tb
es8-data-143-production.xyz.pvt   d                44.50     2.3tb      2.9tb
es8-data-138-production.xyz.pvt   d                42.27     2.2tb        3tb
es8-data-160-production.xyz.pvt   d                45.64     2.4tb      2.9tb
es8-data-170-production.xyz.pvt   d                44.48     2.3tb      2.9tb
es8-data-169-production.xyz.pvt   d                44.15     2.3tb      2.9tb
es8-data-132-production.xyz.pvt   d                42.47     2.2tb        3tb
es8-data-107-production.xyz.pvt   d                44.77     2.4tb      2.9tb
es8-data-154-production.xyz.pvt   d                41.93     2.2tb      3.1tb
es8-data-164-production.xyz.pvt   d                45.17     2.4tb      2.9tb
es8-data-123-production.xyz.pvt   d                44.87     2.4tb      2.9tb
es8-data-102-production.xyz.pvt   d                46.47     2.4tb      2.8tb
es8-access-001-production.xyz.pvt im                0.95   290.4mb     29.7gb
es8-access-000-production.xyz.pvt im                0.83   254.8mb     29.7gb
es8-access-002-production.xyz.pvt im                0.82   253.2mb     29.7gb

Yeah, you may be right.

Your cluster seems to be pretty balanced, are you still having issues?

I think it may be still related to this issue.

Is this the same issue you had in the past in this topic, right?

Maybe the values you used are not enough anymore.

Still having the issue.
The other issue I raised was a behavior change from version 7 -> 8.
Which was resolved by tweaking that threshold value.
The behavior was clearly algorithm change, so I just need to be aware of the difference.

But this ongoing issue started yesterday is something new to me.
It seems the cluster is stuck in some sort of permanent rebalancing mode.
I just need to find out how to get the cluster out of that mode.
Do I need to restart the entire cluster? etc.

We've been running this version for almost 2 years. Something changed yesterday and I'm trying to find out.
Hopefully it's a known issue/behavior...

Does this look right?

I have more than 999+ shards in false state. Would this be the reason for the perpetual rebalancing?
This is from the command
GET _internal/desired_balance

Here's the cluster balance stats:

  "cluster_balance_stats": {
    "tiers": {
      "data": {
        "shard_count": {
          "total": 13479,
          "min": 155,
          "max": 192,
          "average": 172.80769230769232,
          "std_dev": 9.287512559687055
        },
        "forecast_write_load": {
          "total": 0,
          "min": 0,
          "max": 0,
          "average": 0,
          "std_dev": 0
        },
        "forecast_disk_usage": {
          "total": 212348761157460,
          "min": 2460557075048,
          "max": 2947363406633,
          "average": 2722420014839.231,
          "std_dev": 109042303554.36969
        },
        "actual_disk_usage": {
          "total": 212348761182957,
          "min": 2460557075048,
          "max": 2947363406633,
          "average": 2722420015166.115,
          "std_dev": 109042300504.57008
        }
      }
    },

Any info I can derive from this?

Does the "std_dev" in disk_usage look right? It looks kind of huge, doesn't it?
We recently added and retired bunch of nodes. We basically rotate out old nodes+ebs.
Would this be the culprit that screw up rebalancing calculation?

I checked another smaller cluster in our different environment. It also has large "std_dev" value without the perpetual rebalancing.
But "node_is_desired" is true for all shards.

How do I figure out the cause for "node_is_desired": false?

Out of a hunch, I decided to change the settings as follow.

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.balance.threshold": 60,
    "cluster.routing.allocation.cluster_concurrent_rebalance": 200,
    "cluster.routing.allocation.node_concurrent_recoveries": 4,
    "cluster.routing.allocation.node_concurrent_incoming_recoveries": null,
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries": null
  }
}

Hoping to see if it would speed up the rebalancing.
After about 10 hours of work, it finally finished.
I'm going to leave the threshold at 30 to allow more shard delta between nodes, and set the other 2 back to null.

Not sure what triggered the massive rebalance initially. But leaving the settings to default definitely doesn't work. It's been rebalancing for several days moving 2 shards at a time. Our cluster is also in production; therefore, some indices are being created and deleted dynamically. Default setting won't cut it once it's in the state I experienced.

This was an interesting thread. Thanks for sharing @linkerc

The "best" settings in a case like yours seem hard to find, people will likely say because "it depends".

e.g. in this thread on GitHub there is:

Also in larger clusters it is really important to increase cluster.routing.allocation.balance.threshold

and the docs have this:

If you have a large cluster, it may be unnecessary to keep it in a perfectly balanced state at all times. It is less resource-intensive for the cluster to operate in a somewhat unbalanced state rather than to perform all the shard movements needed to achieve the perfect balance. If so, increase the value of cluster.routing.allocation.balance.threshold to define the acceptable imbalance between nodes. For instance, if you have an average of 500 shards per node and can accept a difference of 5% (25 typical shards) between nodes, set cluster.routing.allocation.balance.threshold to 25.

You have around 172 shards per data node, so considering even 10% difference would suggest setting that threshold to ca: 17.

You shared:

        "shard_count": {
          "total": 13479,
          "min": 155,
          "max": 192,
          "average": 172.80769230769232,
          "std_dev": 9.287512559687055
        }

I'm curious if all the shard movements really changed those statistics much?

At that time it maybe looks a bit unbalanced in shard-count, as 155 and 192 are both close to 2 sigma away from the mean. However if the shard count were normally distributed (they're not, but ...) around that mean with that stddev, the min-max range would almost always be larger than (your) 37 / 4-sigma.

You may wish to follow this issue, though it's not seeing much action.

Hmmm.

It happened again after turning September.

I think there’s an algorithm issue with 8.10.2.

It feels like there’s a pended up rebalancing and something cause it to explode.

We do create and delete indices hourly, daily, and monthly automatically. The default 2 shards rebalancing simply can’t keep up with the sudden large reassignment.

I seriously doubt the need to do such massive reassignment of shards. Is it possible that this is the result of multiple iteration of rebalance calculations? The subsequent calculations are based on incomplete rebalancing state?