Shard allocation - strange behaviour of index tier preference

Hi there,

I have a hot-warm cluster with 2 hot and 2 warm nodes (Elastic cloud v8.6.1). I also have an index that I want to distribute to both, hot and warm-tier nodes. To do that, I want to set up the number of replica shards to 2 and ensure the shards are placed on 3 of the 4 nodes of the cluster. For the sake of an argument, let's say I want to place 2 shards on the hot tier nodes, and one on one of the warm tier nodes.

I tried setting up an index setting index.routing.allocation.include._tier_preference to "data_hot,data_warm" but it causes one of the shards to go unallocated. When I am setting it to "data_warm,data_hot", it works fine. I want to be able to explain this behaviour, if possible, and understand the logic behind this setting. The documentation wasn't that helpful, but it could be me...

Here are the steps to reproduce it:

# Unallocated shard
PUT test1

PUT test1/_settings
{
  "index.routing.allocation.include._tier_preference": "data_hot,data_warm",
  "number_of_replicas": 2
}

GET _cat/indices/test1?v
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   test1 l7bDEILVR02OTAI43P_pvA   1   2          0            0       450b           225b


GET _cat/shards/test1?v
index shard prirep state      docs store ip         node
test1 0     p      STARTED       0  225b 10.1.9.157 instance-0000000000
test1 0     r      STARTED       0    0b 10.1.13.47 instance-0000000001
test1 0     r      UNASSIGNED     


# All the three shards get allocated
PUT test2

PUT test2/_settings
{
  "index.routing.allocation.include._tier_preference": "data_warm,data_hot",
  "number_of_replicas": 2
}

GET _cat/indices/test2?v
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   test2 3d5935zXQQGL225PIBrfCA   1   2          0            0       450b           225b

GET _cat/shards/test2?v
index shard prirep state   docs store ip         node
test2 0     r      STARTED    0  225b 10.1.13.37 instance-0000000005
test2 0     r      STARTED    0  225b 10.1.16.5  instance-0000000004
test2 0     p      STARTED    0  225b 10.1.9.157 instance-0000000000

Thanks in advance

Hi Michael,
Yes it is kind of tricky, but once you figure it out it works well.
If you want to do an accurate test, you need to set _tier_preference on index creation.

In your tests, both indices are created with the default tier_preference which is data_content , whose nodes are generally the same as data_hot .
Then shards are allocated based on tier availability. In your case, both tiers are available so preferrence can be satisfied.
So for index test1 you don't change the preferred tier (data content = data hot)
and for index test2 you change the preferred tier, allowing shards to be created on data_warm when you change the setting.
If you want to figure this out, relauch the test with a GET _cat/shards/test2?v after creating the index with:

PUT test2
{
  "settings": {
    "number_of_replicas": 2,
    "index.routing.allocation.include._tier_preference": "data_warm,data_hot"
  }
}

I assume you would have some unallocated shards.
To sum it up, I think node roles do not permit what you're trying to achieve. You could rather have a look at Cluster-level shard allocation and routing settings | Elasticsearch Guide [8.6] | Elastic

1 Like

Thanks, I tried to do the settings at index creation time, but now in both cases there is one unallocated shard with two other shards residing on the hot tier nodes:

PUT test1
{
  "settings": { 
    "index.routing.allocation.include._tier_preference": "data_hot,data_warm",
    "number_of_replicas": 2
  }
}
GET _cat/shards/test1?v
index shard prirep state      docs store ip         node
test1 0     p      STARTED       0  225b 10.1.9.157 instance-0000000000
test1 0     r      STARTED       0    0b 10.1.13.47 instance-0000000001
test1 0     r      UNASSIGNED   

PUT test2
{
  "settings": { 
    "index.routing.allocation.include._tier_preference": "data_warm,data_hot",
    "number_of_replicas": 2
  }
}
GET _cat/shards/test2?v
index shard prirep state      docs store ip         node
test2 0     p      STARTED       0  225b 10.1.13.37 instance-0000000005
test2 0     r      STARTED       0  225b 10.1.16.5  instance-0000000004
test2 0     r      UNASSIGNED     

I made sure I don't impact it with any custom transient or persistent cluster-level settings:

PUT _cluster/settings
{
  "transient": {
      "cluster.routing.allocation.awareness.attributes": null
  } 
}

What I am after is to understand the behaviour of these settings in conjunction with each other. So far, there is only one way to make things work the way I need them to be (distribution of index shards between hot and warm nodes), but I don't understand the logic of why it does or doesn't work.

Really appreciate your help with it BTW @vincenbr

Hi Michael,
If I understand correctly, you cannot achieve what you want with ILM / data tiers enabled indices.
The reason here is that when you set an index _tier_preference to data_warm,data_hot and the first tier (here data_warm) is existing on your cluster, then your index is "fated" to live on this tier . Even if there are not enough nodes of this tier to accomodate the required number of replicas.
The whole point of data tiers is to isolate indices on nodes (thus hardware resources) that are suited to their requirements.

  • hot -> best performance, best disk speed
  • warm -> lower performance (RAM to index data ratio), slower disks
  • cold -> you get the idea
    If you do need an index to spread across this segmentation, probably better to remove its _tier_preference. But this index could not be managed by ILM.
    Hope this helps !

My problem is exactly the opposite. I want the shard allocation to ignore the hot and warm node roles. What is the best way to achieve it in a hot-warm cluster?

you can achieve that at index level.
If you remove (set to null) index.routing.allocation.include._tier_preference for a specific index's settings, then the allocation process will ignore tiers and this index will be distributed independently of data node roles.

It doesn't work, unfortunately:

# PUT test2 200 
{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "test2"
}
# GET _cat/shards/test2?v 200 
index shard prirep state      docs store ip         node
test2 0     p      STARTED       0  225b 10.1.9.157 instance-0000000000
test2 0     r      STARTED       0  225b 10.1.13.47 instance-0000000001
test2 0     r      UNASSIGNED  

I mentioned the fact that you can disable tier preference for the index you want to span over different tiers.

To do this, you must execute on an existing index:

PUT test2/_settings
{
  "index.routing.allocation.include._tier_preference": null
}

Then check shards allocation. What does it say ?

Can't get it to work. I created a brand new 2h/2w cluster.

# Creating the index
PUT test1
{
  "settings": {
    "index.routing.allocation.include._tier_preference": null,
    "number_of_replicas": 2
  }
}

# GET _cat/shards/test1?v 200 OK
index shard prirep state      docs store ip            node
test1 0     p      STARTED       0  225b 10.47.192.173 instance-0000000000
test1 0     r      STARTED       0  225b 10.47.192.99  instance-0000000001
test1 0     r      UNASSIGNED                          

# GET _cluster/allocation/explain 200 OK
{
  "index": "test1",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "INDEX_CREATED",
    "at": "2023-03-07T22:56:28.660Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "no",
  "allocate_explanation": "Elasticsearch isn't allowed to allocate this shard to any of the nodes in the cluster. Choose a node to which you expect this shard to be allocated, find this node in the node-by-node explanation, and address the reasons which prevent Elasticsearch from allocating this shard there.",
  "node_allocation_decisions": [
    {
      "node_id": "79fDvTXCTr24DjmsUMsYOA",
      "node_name": "instance-0000000003",
      "transport_address": "10.47.192.117:19037",
      "node_attributes": {
        "region": "unknown-region",
        "instance_configuration": "gcp.data.highstorage.1",
        "server_name": "instance-0000000003.4f379901a9b14417a3e141d745799f8f",
        "data": "warm",
        "xpack.installed": "true",
        "logical_availability_zone": "zone-1",
        "availability_zone": "us-west2-b"
      },
      "node_decision": "no",
      "weight_ranking": 1,
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_content] and node does not meet the required [data_content] tier"
        }
      ]
    },
    {
      "node_id": "NZ2kKMF3R-WkhNF56SCJgA",
      "node_name": "instance-0000000002",
      "transport_address": "10.47.192.127:19677",
      "node_attributes": {
        "region": "unknown-region",
        "instance_configuration": "gcp.data.highstorage.1",
        "server_name": "instance-0000000002.4f379901a9b14417a3e141d745799f8f",
        "data": "warm",
        "xpack.installed": "true",
        "logical_availability_zone": "zone-0",
        "availability_zone": "us-west2-c"
      },
      "node_decision": "no",
      "weight_ranking": 2,
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_content] and node does not meet the required [data_content] tier"
        }
      ]
    },
    {
      "node_id": "ZktlgqMhQF6opcfiq9mP6g",
      "node_name": "instance-0000000001",
      "transport_address": "10.47.192.99:19787",
      "node_attributes": {
        "region": "unknown-region",
        "instance_configuration": "gcp.data.highio.1",
        "server_name": "instance-0000000001.4f379901a9b14417a3e141d745799f8f",
        "data": "hot",
        "xpack.installed": "true",
        "logical_availability_zone": "zone-1",
        "availability_zone": "us-west2-b"
      },
      "node_decision": "no",
      "weight_ranking": 3,
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[test1][0], node[ZktlgqMhQF6opcfiq9mP6g], [R], s[STARTED], a[id=neRr7bqfTRqodGpKFEjcGw], failed_attempts[0]]"
        }
      ]
    },
    {
      "node_id": "q_tYt6RFSHO5efwV-IAjsQ",
      "node_name": "instance-0000000000",
      "transport_address": "10.47.192.173:19307",
      "node_attributes": {
        "region": "unknown-region",
        "instance_configuration": "gcp.data.highio.1",
        "server_name": "instance-0000000000.4f379901a9b14417a3e141d745799f8f",
        "data": "hot",
        "xpack.installed": "true",
        "logical_availability_zone": "zone-0",
        "availability_zone": "us-west2-c"
      },
      "node_decision": "no",
      "weight_ranking": 4,
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[test1][0], node[q_tYt6RFSHO5efwV-IAjsQ], [P], s[STARTED], a[id=HQbhwF4zTyalKa3rCQW9cA], failed_attempts[0]]"
        }
      ]
    }
  ]
}
# GET _cluster/allocation/explain 200 OK
{
  "index": "test1",
  "shard": 0,
  "primary": true,
  "current_state": "started",
  "current_node": {
    "id": "q_tYt6RFSHO5efwV-IAjsQ",
    "name": "instance-0000000000",
    "transport_address": "10.47.192.173:19307",
    "attributes": {
      "server_name": "instance-0000000000.4f379901a9b14417a3e141d745799f8f",
      "instance_configuration": "gcp.data.highio.1",
      "region": "unknown-region",
      "availability_zone": "us-west2-c",
      "logical_availability_zone": "zone-0",
      "xpack.installed": "true",
      "data": "hot"
    },
    "weight_ranking": 3
  },
  "can_remain_on_current_node": "yes",
  "can_rebalance_cluster": "no",
  "can_rebalance_cluster_decisions": [
    {
      "decider": "rebalance_only_when_active",
      "decision": "NO",
      "explanation": "rebalancing is not allowed until all replicas in the cluster are active"
    },
    {
      "decider": "cluster_rebalance",
      "decision": "NO",
      "explanation": "the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active]"
    }
  ],
  "can_rebalance_to_other_node": "no",
  "rebalance_explanation": "Elasticsearch is not allowed to allocate or rebalance this shard to another node. If you expect this shard to be rebalanced to another node, find this node in the node-by-node explanation and address the reasons which prevent Elasticsearch from rebalancing this shard there.",
  "node_allocation_decisions": [
    {
      "node_id": "79fDvTXCTr24DjmsUMsYOA",
      "node_name": "instance-0000000003",
      "transport_address": "10.47.192.117:19037",
      "node_attributes": {
        "server_name": "instance-0000000003.4f379901a9b14417a3e141d745799f8f",
        "instance_configuration": "gcp.data.highstorage.1",
        "region": "unknown-region",
        "availability_zone": "us-west2-b",
        "logical_availability_zone": "zone-1",
        "xpack.installed": "true",
        "data": "warm"
      },
      "node_decision": "no",
      "weight_ranking": 1,
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_content] and node does not meet the required [data_content] tier"
        }
      ]
    },
    {
      "node_id": "NZ2kKMF3R-WkhNF56SCJgA",
      "node_name": "instance-0000000002",
      "transport_address": "10.47.192.127:19677",
      "node_attributes": {
        "server_name": "instance-0000000002.4f379901a9b14417a3e141d745799f8f",
        "instance_configuration": "gcp.data.highstorage.1",
        "region": "unknown-region",
        "availability_zone": "us-west2-c",
        "logical_availability_zone": "zone-0",
        "xpack.installed": "true",
        "data": "warm"
      },
      "node_decision": "no",
      "weight_ranking": 2,
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_content] and node does not meet the required [data_content] tier"
        }
      ]
    },
    {
      "node_id": "ZktlgqMhQF6opcfiq9mP6g",
      "node_name": "instance-0000000001",
      "transport_address": "10.47.192.99:19787",
      "node_attributes": {
        "server_name": "instance-0000000001.4f379901a9b14417a3e141d745799f8f",
        "instance_configuration": "gcp.data.highio.1",
        "region": "unknown-region",
        "availability_zone": "us-west2-b",
        "logical_availability_zone": "zone-1",
        "xpack.installed": "true",
        "data": "hot"
      },
      "node_decision": "no",
      "weight_ranking": 3,
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[test1][0], node[ZktlgqMhQF6opcfiq9mP6g], [R], s[STARTED], a[id=neRr7bqfTRqodGpKFEjcGw], failed_attempts[0]]"
        }
      ]
    }
  ]
}

The following request does not do what you want.

PUT test1
{
  "settings": {
    "index.routing.allocation.include._tier_preference": null,
    "number_of_replicas": 2
  }
}

While creating the index this way the setting with null is ignored and will get the default value, which is data_content, this is confirmed by the result of your allocation explain.

index has a preference for tiers [data_content] and node does not meet the required [data_content] tier

If you want to set the value of the _tier_preference to null you need to first create the index and then use the _settings endpoint.

Try this:

PUT test1
{
  "settings": {
    "number_of_replicas": 2
  }
}

PUT test1/_settings
{
  "index.routing.allocation.include._tier_preference": null
}
1 Like

That does the trick! Thank you. It would be good to understand the logic behind it though. For example, why is this setting not working at index creation time, only afterwards? Also, why is the order of data_hot and data_warm matters.

PUT test1
{
  "settings": {
    "number_of_replicas": 2
  }
}

PUT test1/_settings
{
      "index.routing.allocation.include._tier_preference": null
}

# GET _cat/indices/test1?v 200 OK
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   test1 WNu_EyyQS5qANnldJC23_w   1   2          0            0       675b           225b

# GET _cat/shards/test1?v 200 OK
index shard prirep state   docs store ip            node
test1 0     p      STARTED    0  225b 10.47.192.117 instance-0000000003
test1 0     r      STARTED    0  225b 10.47.192.99  instance-0000000001
test1 0     r      STARTED    0  225b 10.47.192.127 instance-0000000002

When you create an index using PUT index and passa the settings in the body of the request, if the setting is null, it will be ignored.

When you use PUT index/_settings you are explicitly changing the setting for the index, but you can only do that after the index was created.

Normally you would use a template with the settings you want, and this template would be applied when the index is created.

Not sure why this happens, but is probably related to the role of each node on Elastic Cloud, but you can use the cluster allocation explain API to understand why the replicas are allocated or not.

You can use the following request:

GET _cluster/allocation/explain?include_yes_decisions
{
  "index": "indexName",
  "shard": 0,
  "primary": false
}

The include_yes_decisions will also tell you why the shard was allocated, so you can see the difference.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.