Cluster.routing.allocation.exclude._name not working

philschroeder · December 12, 2018, 8:56pm

Hi,

I am trying to decommission some nodes in an old (ES2.2) cluster. First step is moving the data off of the nodes. I'm using:

curl -XPUT ES_HOST:9200/_cluster/settings?explain -H 'Content-Type: application/json' -d '{
      "transient" :{
          "cluster.routing.allocation.exclude._name" : "ES_NODENAME"
       }
  }'

And I get the typical ack:

> {"acknowledged":true,"persistent":{},"transient":{"cluster":{"routing":{"allocation":{"exclude":{"_name":"ES_NODENAME"}}}}}}

However, shards do not start relocating like I'd expect. Nothing happens. I check the cluster settings, and see what I've issued:

> curl -XGET 'http://ES_HOST:9200/_cluster/settings?pretty'
> {
>   "persistent" : {
>     "threadpool" : {
>       "search" : {
>         "queue_size" : "10000"
>       }
>     }
>   },
>   "transient" : {
>     "cluster" : {
>       "routing" : {
>         "allocation" : {
>           "cluster_concurrent_rebalance" : "-1",
>           "exclude" : {
>             "_name" : "ES_NODENAME"
>           }
>         }
>       }
>     }
>   }
> }

At this point I'm at a loss. There's nothing in the logs regarding this. Anyone have any suggestions?

DavidTurner · December 12, 2018, 10:06pm

This all looks in order, except that I guess you've redacted the node name. The first thing I would do is to triple-check that this is correct (vs the node name reported by GET /). Get someone else to look at it if possible.

The next thing I'd suspect is that the shards are staying on this node because they can't be reallocated elsewhere for some other reason. Try creating a new index with a single shard and enough replicas to give one shard copy to every node in your cluster. If all of the new shards are allocated then I'm stumped, but if it prefers not to allocate a copy to the excluded node then the exclusion is working and there's some other reason for nothing happening.

philschroeder · December 12, 2018, 10:25pm

Thanks David. I did redact the node name (elasticsearch14a). I thought that might the issue as well, but I made sure the node shows up in /_cat/nodes, and checked GET / like you suggested, and spelling is right and all.

philschroeder · December 12, 2018, 10:38pm

I created a test_index index with one shard and 23 replicas (this is a 24 node cluster). I see that I have one unassigned shard, with the excluded node being the one that doesn't have a replica:

test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch03a
test_index     0  p STARTED            0    159b 10.X.X.X elasticsearch09a
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch10b
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch07a
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch04b
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch05a
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch04a
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch11b
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch02a
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch11a
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch06a
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch03b
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch10a
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch13a
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch07b
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch13b
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch14a
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch05b
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch08a
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch09b
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch08b
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch02b
test_index     0  r STARTED            0    159b 10.X.X.X elasticsearch06b
test_index     0  r UNASSIGNED

So it appears that the 'disable allocation' is working, but perhaps something else is stopping data from moving off of the node. I'm somewhat green regarding ES administration...what are my next steps?

philschroeder · December 13, 2018, 1:07am

Think I have figured out the issue. I tried rerouting a single shard from the excluded node to another, and got this inside of the response:

[NO(too many shards for this index [sessions-17w29] on node [1], limit: [1])]

Someone set these indexes to have 24 shards, and now that we're down to 24 nodes, I will need to set total_shards_per_node to something higher than 1. Sounds like I need to do a little research into what the I/O hit will be.

DavidTurner · December 13, 2018, 8:46am

Yes, that'd explain it. Great work.

system · January 10, 2019, 8:46am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster.routing.allocation.exclude._name stops working [ES6.1.2] Elasticsearch	3	1340	March 2, 2018
Why does cluster.routing.allocation.exclude._ip only work as a transient, not persistent setting? Elasticsearch	4	1090	November 2, 2023
Unable to decommission nodes from cluster Elasticsearch	5	1174	July 6, 2017
Shards refuse to relocate to different nodes using cluster.routing.allocation.exclude Elasticsearch	3	2208	July 13, 2019
Decomissioning node question, does not start moving shards Elasticsearch	3	1839	September 1, 2017

Cluster.routing.allocation.exclude._name not working

Related topics