Undo decommissioning of a node

loren · January 15, 2018, 11:40pm

I am in the process of shrinking a 1.7 cluster as indices get moved over to a 6.1 cluster. I shrank it from 35 nodes to 8 using shard allocation filtering, and this worked just great. I then specified another 3 nodes for draining, but halfway through I realized we still needed them for a few more days. How to undo this?

Based on this and this, I tried to un-decommission them by doing curl -XPUT 'localhost:9200/_cluster/settings?pretty' -H 'Content-Type: application/json' -d' { "transient" : { "cluster.routing.allocation.exclude._name" : "" } } '
but this didn't stop the shards from being relocated. Then I tried filtering against a nonsense name, but that didn't work either: curl -XPUT 'localhost:9200/_cluster/settings?pretty' -H 'Content-Type: application/json' -d' { "transient" : { "cluster.routing.allocation.exclude._name" : "halp" } } '

Then I tried doing a full restart of the cluster, bringing all nodes down, to reset this transient setting. It's still trying to drain those 3 nodes.

The only way I've found to stop it is to set some artificially high disk watermarks, but there has got to be a better way even on ES1.7.

How can I stop this madness?

loren · January 16, 2018, 8:18pm

On another much smaller 1.7 cluster, I was able to exclude a node, watch it drain, set exclude._name to "", and then see that shards immediately got relocated to that node. So that is apparently not the problem.

On the other two larger clusters (400 indices, 3000 shards, 12 nodes), I think there's just something else going on there, possibly due to the forced awareness of AZ's and/or possibly due to the sheer size of the cluster in terms of shard count. Like, even with 12 data nodes and everything pretty balanced AFAICT, one cluster is still going nuts relocating/replicating shards. And it's not all in one direction either. Sometimes a node gets a few hundred GBs of data only to have another few hundred GBs moved away immediately after. It's unclear to me how the rebalancing is getting determined, or how long it will take to quiesce.

At any rate, this seems to be an issue around rebalancing heuristics and not shard allocation filtering.

system · February 13, 2018, 8:28pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster.routing.allocation.exclude._name not working Elasticsearch	6	4777	January 10, 2019
Unable to decommission nodes from cluster Elasticsearch	5	1178	July 6, 2017
Remaining shards after using allocation.exclude Elasticsearch	4	507	March 17, 2022
Shards refuse to relocate to different nodes using cluster.routing.allocation.exclude Elasticsearch	3	2265	July 13, 2019
Decomissioning node question, does not start moving shards Elasticsearch	3	1843	September 1, 2017

Undo decommissioning of a node

Related topics