Replacing a cluster node

Hi, we have a 3 node cluster, with 3 ME nodes. Today I needed to replace one node with another on a new server. I simply shut down ES on the old server, and started ES on the new server. It did join the cluster according to _cat/nodes, but no data/shards started being replicated to it. Cluster health remained yellow up until 2 previous nodes relocated all missing data to themselves. Now I have 2 nodes each having the full dataset (according to df -h), and the new node which doesn't have any index data. What's going on and how do I let it join the cluster normally and copy some of the data to itself? Thanks.

ES version 7.17.11

It's best to use the cluster allocation explain API to explain the allocation of shards. If you need help understanding its output, please feel free to share it here.

1 Like

The cluster allocation explain will certainly tell you why shards haven't moved to your new node. I have this sometimes and the allocation explain really helps and gives you a great starting point as to where to go next.

It would be interesting to see your output.

You may also want to to look at running cluster/_reroute command.

Cluster reroute API | Elasticsearch Guide [8.11] | Elastic

The cluster will attempt to allocate a shard a maximum of index.allocation.max_retries times in a row (defaults to 5 ), before giving up and leaving the shard unallocated.

I believe the reroute command restarts the process.

1 Like

I don't think this is great advice. If a call to this API is needed, the allocation explain API will tell you about it. If it's not needed then it's best not to call it.

Sorry, I meant to add that I agree with the rest of the message from @intrepid1 :slight_smile: Allocation explain really is a useful API in this context. It's unlikely to relate to index.allocation.max_retries in this situation since it sounds like there are no unassigned shards, but this limit is still useful to know about.

Thanks for the tips! Here's the explanation output:

First server is the one having no data, second & third - each having the full dataset.

[rihad@eldey ~]$ curl -X GET "amiata.local:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
{
  "index": "my-index",
  "shard": 0,
  "primary": false,
  "current_node": "amiata.example.com"
}
'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "unable to find a replica shard assigned to node [amiata.example.com]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find a replica shard assigned to node [amiata.example.com]"
  },
  "status" : 400
}
[rihad@eldey ~]$ curl -X GET "eldey.local:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
{
  "index": "my-index",
  "shard": 0,
  "primary": false,
  "current_node": "eldey.example.com"
}
'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "unable to find a replica shard assigned to node [eldey.example.com]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find a replica shard assigned to node [eldey.example.com]"
  },
  "status" : 400
}
[rihad@eldey ~]$ curl -X GET "pico.local:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
{
  "index": "my-index",
  "shard": 0,
  "primary": false,
  "current_node": "pico.example.com"
}
'
{
  "index" : "my-index",
  "shard" : 0,
  "primary" : false,
  "current_state" : "started",
  "current_node" : {
    "id" : "qj963ExLRI-bceFokseNTQ",
    "name" : "pico.example.com",
    "transport_address" : "172.16.1.11:9300",
    "attributes" : {
      "xpack.installed" : "true",
      "transform.node" : "true"
    },
    "weight_ranking" : 1
  },
  "can_remain_on_current_node" : "yes",
  "can_rebalance_cluster" : "yes",
  "can_rebalance_to_other_node" : "no",
  "rebalance_explanation" : "cannot rebalance as no target node exists that can both allocate this shard and improve the cluster balance",
  "node_allocation_decisions" : [
    {
      "node_id" : "jxM3gIeDQAu8ex9ghdjshg",
      "node_name" : "eldey.example.com",
      "transport_address" : "172.16.1.8:9300",
      "node_attributes" : {
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[my-index][0], node[jxM3gIeDQAu8ex9ghdjshg], [P], s[STARTED], a[id=gh3V_k79Rg6KW2OP02ijTw]]"
        }
      ]
    },
    {
      "node_id" : "pXeE9Ij_T-64jFUDMhJ34w",
      "node_name" : "amiata.example.com",
      "transport_address" : "172.16.1.6:9300",
      "node_attributes" : {
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "worse_balance",
      "weight_ranking" : 1
    }
  ]
}

Hmm that says that moving the shard to this node would make the cluster more imbalanced. What does GET _cat/allocation return?

[rihad@eldey ~]$ curl pico.local:9200/_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host        ip          node
     0           0b     284kb      1.3tb      1.3tb            0 172.16.1.6  172.16.1.6  amiata.example.com
     1       23.6gb    23.4gb      1.6tb      1.6tb            1 172.16.1.8  172.16.1.8  eldey.example.com
     1       23.6gb    23.3gb      1.6tb      1.6tb            1 172.16.1.11 172.16.1.11 pico.example.com

Ok you only have 2 shards, with three nodes there's always going to be one node with zero shards.

1 Like

Oh, indeed, another cluster that hasn't had node replacement also exhibits this behavior with 2 shards:

shards disk.indices disk.used disk.avail disk.total disk.percent host        ip          node
     0           0b     228kb      1.6tb      1.6tb            0 172.16.1.21 172.16.1.21 uranus.local
     1        4.3gb     4.2gb      1.6tb      1.6tb            0 172.16.1.25 172.16.1.25 sun.local
     1        4.4gb     4.2gb    766.2gb    770.5gb            0 172.16.1.18 172.16.1.18 moon.local

Yet another with 10 shards has this:

shards disk.indices disk.used disk.avail disk.total disk.percent host        ip          node
     3        4.4gb     4.1gb      3.2tb      3.2tb            0 172.16.1.23 172.16.1.23 camille.local
     4        5.9gb     5.5gb      690gb    695.6gb            0 172.16.1.24 172.16.1.24 carol.local
     3        4.5gb     4.2gb      3.2tb      3.2tb            0 172.16.1.22 172.16.1.22 bahamas.local

So this is normal, so to speak. The number of shards is auto-regulated. I believe tweaking it manually has nothing to do with resilience, but more with performance? A way to improve resilience would be to add more nodes, like 5 nodes would allow up to 2 nodes to be lost.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.