BUG: Elasticsearch ignoring node.roles after upgrade from 7.6.1

Hello everyone,

When upgrading from 7.6.1 to 7.10.2, I replaced the old node.master and node.data setting to the new node.roles setting. I changed 2 of the 5 data nodes to data_cold only. But if the node has already been used as an data node before it will not respect this role. It will start and work as expected but it will still hold warm/hot shards. You can also still move warm/hot shards to the node.

When I do an fresh install with 7.10.2 with the same config and cluster settings it does use the data_cold role as expected.
I typed out an simple log to recreate this problem if anyone wants to recreate it.

Note: upgrading to the newest version of Elastic also did not work.

Can you share your config please.

cluster.name: test-cluster
node.name: elkm02
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 192.168.1.5
discovery.seed_hosts: ["192.168.1.1", "192.168.1.2", "192.168.1.3", "192.168.1.5"]
cluster.initial_master_nodes: ["192.168.1.1", "192.168.1.5"]
node.roles: master, ingest

Ofcourse, this is the config of the cluster I manged to recreate the problem in.
This is the second test master, only changes are the name and roles

Any luck finding something? I can post the steps for what I did to recreate the problem if that helps.

Could you share GET _cat/nodes from the cluster exhibiting the problem?

If after completing the upgrade you do a further rolling restart (i.e. restart all nodes, one-by-one) does the problem persist?

192.168.1.3 56 95 4 0.33 0.14 0.13 hsw - elkd02
192.168.1.4  8 95 5 0.16 0.03 0.05 c   - elkd03
192.168.1.5 34 95 2 0.08 0.02 0.03 im  * elkm02
192.168.1.2 15 95 1 0.62 0.86 0.87 hsw - elkd01
192.168.1.1 46 94 2 0.29 0.25 0.22 im  - elkm01

Yes, even if I put the entire cluster down and up the problem still persists.

Ok would you use the cluster allocation explain API to explain the allocation of one of the shards you think to be allocated in the wrong place?

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status" : 400
}

That is the weird thing. The server thinks that its all fine, even tho there are hot/warm indices on the cold node

You need to tell the API which shard to explain, otherwise it just picks a random unassigned one and fails if all shards are assigned.

Sorry, my bad.
Here it is:

{
  "index" : "my-index-000006",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "cFGH4_FKRoKPlEg8rPD6Mg",
    "name" : "elkd03",
    "transport_address" : "192.168.1.4:9300",
    "attributes" : {
      "xpack.installed" : "true",
      "transform.node" : "false"
    },
    "weight_ranking" : 1
  },
  "can_remain_on_current_node" : "yes",
  "can_rebalance_cluster" : "yes",
  "can_rebalance_to_other_node" : "no",
  "rebalance_explanation" : "cannot rebalance as no target node exists that can both allocate this shard and improve the cluster balance",
  "node_allocation_decisions" : [
    {
      "node_id" : "PEb7VU_1RKa8q3HN8J7LCA",
      "node_name" : "elkd02",
      "transport_address" : "192.168.1.3:9300",
      "node_attributes" : {
        "xpack.installed" : "true",
        "transform.node" : "false"
      },
      "node_decision" : "worse_balance",
      "weight_ranking" : 1
    },
    {
      "node_id" : "iNdDsFRiSquDN_V8iXwSPA",
      "node_name" : "elkd01",
      "transport_address" : "192.168.1.2:9300",
      "node_attributes" : {
        "xpack.installed" : "true",
        "transform.node" : "false"
      },
      "node_decision" : "worse_balance",
      "weight_ranking" : 1
    }
  ]
}

This is an newly made index with one shard that is located on the cold node.

Ok this shard can be allocated to all three nodes. What does GET /my-index-000006/_settings return? What steps have you taken to exclude it from the cold node?

{
  "my-index-000006" : {
    "settings" : {
      "index" : {
        "creation_date" : "1622533255026",
        "number_of_shards" : "1",
        "number_of_replicas" : "0",
        "uuid" : "aTg9qJNSQxOEJwl-IVU1wA",
        "version" : {
          "created" : "7060199",
          "upgraded" : "7100299"
        },
        "provided_name" : "my-index-000006"
      }
    }
  }
}

Currently on this test environment nothing outside of the roles, If I recreate it like this with an fresh install the shard cannot be moved to that cold node since it does not meet the requirement.

[NO(index has a preference for tiers [data_content] and node does not meet the required [data_content] tier)]

Would it be possible that Elastic has an internal system that prevents the cold/hot/warm role from working until there are enough managed indices?
I have been testing with that theory and it seems to be working but then why do the roles work immediately with an fresh install.

This index has no settings to restrict its allocation to any particular tier so it can be allocated anywhere. If you check GET /$INDEX/_settings on your fresh install you will see that newer indices do have allocation settings applied. If you want to restrict the older indices to particular tiers you'll need to apply those settings yourself.

Weird then, then I still can't explain the behavior on the production server. It ignored the roles completely until I added an new cold policy to change an bunch of older indices to cold. But we already had an working hot/warm/cold ILM in place at that time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.