We have a scenario where we need to shrink a 60 data-node (+ 1 master) cluster down to just 30 data nodes. The cluster currently holds about 1.8PB of data and each data node is a powerful bare metal server connected via 10GbE switch.
Normally when decommissioning a node we just use cluster-level shard allocation filtering ("cluster.routing.allocation.exclude._name": "node name") but we've never done that on this scale before so we're wondering what the best approach is.
What could potentially happen if the 30 nodes are excluded all at once? Could the 10GbE switch be overwhelmed by the migrating data and make the cluster go unstable? If so, is there a dynamic mechanism to throttle the shard relocation somehow?
If excluding all 30 nodes in one go is inadvisable and we have to decommission the nodes in stages, say 10 at a time, is there a way to make the data migrate to just the 30 nodes that are to remain operational? For example if we decommission nodes 51-60 we don't want the data to migrate to nodes 31-50 because the data would then be migrated multiple times as we then decommission nodes 41-50 and finally nodes 31-40.
We use Elasticsearch version 6.7.1 btw.