Cluster shards unbalanced and keep moving shards around after upgrade to 8.8.1

Hello,

Yesterday we upgraded our cluster from 8.5.1 to 8.8.1 and now the shards are unbalacend between the nodes and the cluster keeps moving shards around to try to balance it.

I have a hot/warm architecture with 4 hot nodes and 12 warm nodes with just ~ 50 TB of data in total, even after 12 hours it is still moving shards around.

This is how some of the warm nodes looks like:

All warm nodes have the same specs.

One big change that we got on this upgrade is that on 8.6 the size of the shards started to being considered on the way elastic balance shards, could this be an issue related to this?

Just found this github issue and I think that it is related to my issue.

I have custom settings on 3 of the 4 settings mentioned in the workaround part of the issue:

cluster.routing.allocation.cluster_concurrent_rebalance
cluster.routing.allocation.node_concurrent_incoming_recoveries
cluster.routing.allocation.node_concurrent_outgoing_recoveries

Will set them to the default values to see if it helps.

Yes please use the defaults for these settings.

Also it looks like you have shards of quite different sizes, so when you upgrade we'd expect quite a bit of shard movement to try and balance the disk usage better. Rebalancing does not happen as fast as possible by design (it's a background process which we don't want to affect the regular operation of the cluster) so it could take quite some time to rebalance 50TiB of data.

GET _internal/desired_balance is an internal API but it should be possible for you to watch the balancing progress here.

Hello @DavidTurner,

With the change made in 8.6 that now takes the disk usage as a factor when rebalacing shard, should I expect an uneven number of shards on the nodes of the same tier in the case of shards of different sizes?

On 8.5.1 we had an almost even number of shards on each node, but the disk usage was pretty diffrent, some nodes had for example 1 TB of free space and other only 180 GB of free space, which was an issue for us, so after this upgrade should I expect a more even use of disk space but an uneven number of shards on each node?

What should I look on this API as there is no public documentation about it? I'm not sure how to interpret the response.

For example, this is part of the result of the API, without the nodes and shards information and in my case data_content tier is on the data_warm nodes as well:

{
  "stats": {
    "computation_converged_index": 1010,
    "computation_active": false,
    "computation_submitted": 1011,
    "computation_executed": 1011,
    "computation_converged": 1004,
    "computation_iterations": 6905,
    "computed_shard_movements": 17927,
    "computation_time_in_millis": 203993,
    "reconciliation_time_in_millis": 4806
  },
  "cluster_balance_stats": {
    "tiers": {
      "data_warm": {
        "shard_count": {
          "total": 2306,
          "min": 169,
          "max": 212,
          "average": 192.16666666666666,
          "std_dev": 15.03237247483664
        },
        "forecast_write_load": {
          "total": 0,
          "min": 0,
          "max": 0,
          "average": 0,
          "std_dev": 0
        },
        "forecast_disk_usage": {
          "total": 46521804262408,
          "min": 3643675442782,
          "max": 3977820690712,
          "average": 3876817021867.3335,
          "std_dev": 98262085739.02898
        },
        "actual_disk_usage": {
          "total": 46521804262408,
          "min": 3643675442782,
          "max": 3977820690712,
          "average": 3876817021867.3335,
          "std_dev": 98262085739.02898
        }
      },
      "data_hot": {
        "shard_count": {
          "total": 1166,
          "min": 286,
          "max": 297,
          "average": 291.5,
          "std_dev": 5.024937810560445
        },
        "forecast_write_load": {
          "total": 0,
          "min": 0,
          "max": 0,
          "average": 0,
          "std_dev": 0
        },
        "forecast_disk_usage": {
          "total": 12949387024675,
          "min": 3117120702163,
          "max": 3368843902592,
          "average": 3237346756168.75,
          "std_dev": 110375675929.2402
        },
        "actual_disk_usage": {
          "total": 12949387024675,
          "min": 3117120702163,
          "max": 3368843902592,
          "average": 3237346756168.75,
          "std_dev": 110375675929.2402
        }
      },
      "data_content": {
        "shard_count": {
          "total": 2306,
          "min": 169,
          "max": 212,
          "average": 192.16666666666666,
          "std_dev": 15.03237247483664
        },
        "forecast_write_load": {
          "total": 0,
          "min": 0,
          "max": 0,
          "average": 0,
          "std_dev": 0
        },
        "forecast_disk_usage": {
          "total": 46521804262408,
          "min": 3643675442782,
          "max": 3977820690712,
          "average": 3876817021867.3335,
          "std_dev": 98262085739.02898
        },
        "actual_disk_usage": {
          "total": 46521804262408,
          "min": 3643675442782,
          "max": 3977820690712,
          "average": 3876817021867.3335,
          "std_dev": 98262085739.02898
        }
      }
    }
  }
}

I think that now that the disk usage is take into account, the number of shards in each tier will not always be evenly distributed when the shard size is different, which is now kinda obvious and expected.

For example, these are my 4 hot nodes:

They are in this state for a couple of time now and there is no shard movement between them, so it seems that this is the balanced state of my hot tier now, which is nice since the disk usage is pretty similar and the specs are all the same for them, 4 TB of disk on each.

So, answering my own question, it is now obvious that an uneven number of shards on each node of a specific tier is expected as the disk usage is now taken into consideration.

The routing_table section has an entry per shard, with a node_is_desired flag. The ones with false here are still to be moved.

Yes that's right. Elasticsearch still cares about shard count, but it now aims to strike a balance between even shard count and even disk usage and in doing so will not normally achieve perfection in either variable.

2 Likes

Oh thanks, will write a quick script to get this response and print the shards to be moved in a more friendly way.

I'm using the default settings for the cluster.routing.allocation.* but my cluster now is constantly moving shards around, it seems that it never reaches a balanced state on my warm nodes.

It seems that a moved shard to a node is triggering another rebalance, so since I upgraded to 8.8.1 it keeps moving shards around.

Is there anything I can change to make it stop moving shards around? Is this normal now? It increases the network traffic between the nodes.

Did you manage to analyse the output from GET _internal/desired_balance? How many shards with node_is_desired: false do you have? And is this remaining constant over time?

I'm not monitoring this API, so I cannot say if the value is constant over time.

Just checked here on a couple of intervals, it had 822 shards with false, then 816 and then 821 and after a couple more of minutes it increased to 831.

I could build a quick script to check this API and get the number of shards with node_is_desired as false and run it every 5 minutes, but I will wait a couple more days to see if the issue still persists.

One thing that I need to add is that we have a hot/warm architecture and we are using daily indices with ILM configured, so every day shards will be moved from the hot nodes to the warm nodes and indices will be deleted, this of course will trigger a rebalance.

When the balance of shards didn't take the disk usage in consideration, every node would get a equal number of shards and the balance process would finish pretty quickly, but now it seems that the warm nodes spend all day rebalancing shards and the process doesn't finish before the next ILM trigger, which will then force another rebalance.

I will watch this for a couple more days, but is there any way to go back to the old behavior? Constantly moving shards is not desired because the network load.

Hmm it seems surprising to have 800+ of your ~2300 shards in the wrong place. Could you try DELETE /_internal/desired_balance and see if that gets things to settle down more quickly?

This returns this:

Request failed to get to the server (status code: 200)

I don't recognise that message - it's not coming from Elasticsearch.

I'm running on Kibana Dev Tools.

Our Kibana is behind a GCP load balancing, maybe is from there, let me try to run it directly on a Elasticsearch node.

EDIT:

Running from an Elasticsearch node it seems to have worked, but there is no return for the curl command for that endpoint, is that right?

there is no return for the curl command for that endpoint, is that right?

That's right. Does GET _internal/desired_balance look any better now?

Yeah, just got the response to a json file and got this:

$ cat desired-balance.json | grep "node_is_desired" | grep -v relocating  | grep false | wc -l
53

It is showing 53 now, but a second check it increased to 57 and then to 60.

I will wait a couple of time to see if it starts increasing.

After 2 hours the number increased to 97.

Not sure if this solved anything, after a couple more hours there are now 280 shards with node_is_desired as false.

Hello @DavidTurner,

There was another issue, a couple of the nodes were hitting the watermarks and others were close to it, so probably any shard movement could trigger another watermark that would then trigger another shard movement.

I removed some old data to make sure that no node in the warm tier would hit any watermark, and after sometime there are no more shard movements.

$ cat desired-balance.json | grep "node_is_desired" | grep -v relocating  | grep false | wc -l
0

Hmm, that is a little puzzling. The new allocator should let you run nodes closer to full because of its disk balancing.

I'm not sure if I'm right, but after I removed some old data to free more space on the nodes, the shards stopped moving around,

Before that the number of shards with node_is_desired as false was increasing, last time I checked was around 600, and from the 12 warm nodes, 2 have hit the second watermark already, the high watermark.

After I freed somespace on the warm tier, things went back to normal.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.