Can both allocate this shard and improve the cluster balance

Hi everyone,

I've recently been setting up a new ELK stack, and am naturally trying to use all the cool tools and features possible.
In times past we've used regular indices and curator to move their shards about based on custom node attributes.
For this stack I'm using data streams and setting up ILM to move things about based on the built-in hot/warm/cold node roles. It's not quite doing what we expect.

The stack has a bunch of master and coordinating nodes, 3 physical hot nodes, 3 physical warm nodes, and 2 physical cold nodes.
The physical nodes are very large. Even the rate of indexing the hot nodes are doing doesn't stress the CPU much at all. Warm and cold servers are basically idle at this time. Heap is all good.

_ilm/explain shows indices do move into the warm phase, which can be seen by ILM setting index.routing.allocation.include._tier_preference to "data_warm,data_hot" on the indices behind the data stream.
However the problem we have is that not all of the shards move. Indices have between 1 and 4 primary shards, and all have 1 replica.

_cluster/allocation/explain produces interesting output

{
  "index" : ".ds-logs-nginx-xwing-2021.08.08-000022",
...
  "current_node" : {
    "id" : "81c4tUd9RJmWN8LcuINxHg",
    "name" : "elk3-warm1",
    "transport_address" : "172.16.17.34:9300",
    "attributes" : {
      "temperature" : "warm",
      "xpack.installed" : "true",
      "transform.node" : "false"
    },
    "weight_ranking" : 1
  },
  "can_remain_on_current_node" : "yes",
  "can_rebalance_cluster" : "yes",
  "can_rebalance_to_other_node" : "no",
  "rebalance_explanation" : "cannot rebalance as no target node exists that can both allocate this shard and improve the cluster balance",
  "node_allocation_decisions" : [
    {
      "node_id" : "I4eupHHOQKe8XM1FVuVuxg",
      "node_name" : "elk3-hot3",
      "transport_address" : "172.16.17.39:9300",
      "node_attributes" : {
        "temperature" : "hot",
        "xpack.installed" : "true",
        "transform.node" : "false"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[.ds-logs-nginx-xwing-2021.08.08-000022][1], node[I4eupHHOQKe8XM1FVuVuxg], [R], s[STARTED], a[id=RtrmtpHsS1a2Ey6UWARSog]]"
        }
      ]
    },
    {
      "node_id" : "-YBn47O-Q9uV2OWwbKwUhQ",
      "node_name" : "elk3-hot1",
      "transport_address" : "172.16.17.37:9300",
      "node_attributes" : {
        "temperature" : "hot",
        "xpack.installed" : "true",
        "transform.node" : "false"
      },
      "node_decision" : "worse_balance",
      "weight_ranking" : 1
    },
    {
      "node_id" : "BbQrlXIiT5-_2Yzr8iD4Hw",
      "node_name" : "elk3-warm3",
      "transport_address" : "172.16.17.36:9300",
      "node_attributes" : {
        "temperature" : "warm",
        "xpack.installed" : "true",
        "transform.node" : "false"
      },
      "node_decision" : "worse_balance",
      "weight_ranking" : 1
    },
    ... warm2 = worse_balance, weight_ranking=1 ...
    ... hot2 = worse_balance, weight_ranking=1 ...
    ... cold1 = worse_balance, weight_ranking=2 ...
    ... cold2 = worse_balance, weight_ranking=3 ...
  ]
}

The above is a, mildly abbreviated, example of _cluster/allocation/explain for shard 1 in an index with the following shards

.ds-logs-nginx-xwing-2021.08.08-000022 0 p STARTED 144804836   50gb 172.16.17.36 elk3-warm3
.ds-logs-nginx-xwing-2021.08.08-000022 0 r STARTED 144804836   50gb 172.16.17.34 elk3-warm1
.ds-logs-nginx-xwing-2021.08.08-000022 1 p STARTED 144821064 49.9gb 172.16.17.34 elk3-warm1
.ds-logs-nginx-xwing-2021.08.08-000022 1 r STARTED 144821064 49.9gb 172.16.17.39 elk3-hot3
.ds-logs-nginx-xwing-2021.08.08-000022 2 p STARTED 144809602 49.9gb 172.16.17.38 elk3-hot2
.ds-logs-nginx-xwing-2021.08.08-000022 2 r STARTED 144809602 49.9gb 172.16.17.35 elk3-warm2
.ds-logs-nginx-xwing-2021.08.08-000022 3 p STARTED 144818360   50gb 172.16.17.36 elk3-warm3
.ds-logs-nginx-xwing-2021.08.08-000022 3 r STARTED 144818360   50gb 172.16.17.37 elk3-hot1

The "worse_balance" decision appears to come from the fact that the warm servers have the same number of shards as the hot servers.
The cold servers have ~6 times as many shards as hot and warm, but they are tiny shards from indices full of junk we loaded at the very beginning and and will be deleted eventually. They were moved there by curator and index.routing.allocation.require.temperature = "cold".
The warm servers have more disk space than hot, and cold servers more than warm, so they should have more shards than the tier above.

My question to the collective is. What do I need to change to make the warm servers (and eventually the cold servers) not produce a worse balance in order for _tier_preference itself to have more "weight"?
Maybe ILM can require something, instead of just including? index.routing.allocation.require._tier_preference doesn't exist.

"Shard balancing heuristics settings" don't have any effect, because from a shards per node basis the cluster is balanced.
"Disk-based shard allocation settings" do have an effect. I'm specifically trying to keep different nodes in the cluster at different disk usage levels though, using time based ILM to move data about. The % free we're trying to keep the hot servers is much higher than the % free we're willing the warm servers to go to.

Thanks
Mike

Hi Mike,

index.routing.allocation.require._tier_preference isn't a weight-based thing, it's binary, either a node satisfies the tier preference or it doesn't. In this case it looks like all of your nodes do satisfy it, which likely means you haven't given them the appropriate roles (data_warm, data_cold etc).

Hey David,

The node roles are set correctly.
Hot nodes are data, data_content, and data_hot.
Warm are data, data_content, and data_warm.
Cold are data, data_content, and data_cold.

I did try changing index.routing.allocation.require._tier_preference to just data_warm on another index but the same worse_balance node decision was made by the allocator.

Ah ok sounds like the problem is that you're still using the legacy data role too. This role means "all tiers".

David, thank you.
That was exactly it. Removing the data node role caused Elasticsearch move every shard to the correct place.
The documentation on the data role is considerably less clear that the code you posted.

That's a good point. Would you open a bug report about those docs on Github and link back to this thread?