Multi-node shrink prototype

tobyb121 · March 10, 2025, 1:00pm

I've seen issues when running on larger clusters caused by index shrink. The root cause of these issues tends to be the requirement for all shards to be recovered on a single node, as this can cause significant hotspots in the cluster, require significant (temporary) disk+memory utilization on that node and has been the cause of a couple of significant outages for our product in the past. In most cases the best way to address this has been to effectively have dedicated nodes for performing the shrink and recovery, to avoid any impact on normal search and indexing operations (as well as tuning index recovery bandwidth etc. with limited success).

I spent some time looking at the code for this and thinking about ways it could be improved and while the actual shard recovery itself already works perfectly well when only recovering an individual shard, the allocation of specific shards to specific nodes is not possible with the existing discovery filters.

I had the idea of creating a allocation decider that allows you to define a group of node IDs and then allows allocation of (primary) shards to node_group[shard_id % len(node_group)], with a small adjustment to which source shards get selected for a shrunk shard, this allows you to define the shrink operation in terms of the locations of the target shards and so spread the recovery across all of those nodes.

Before I go too far actually implementing this properly (adding tests, integrating with ILM/existing settings etc.) I wanted to get some opinions on this approach to see if it's a worthwhile change and if there is anything I've missed in my thought process.

DavidTurner · March 10, 2025, 2:20pm

Thanks for the question @tobyb121 and for your interest in fixing this longstanding issue in ES. In practice I think we'd rather take the approach suggested in #63519 which would avoid the need to relocate shards at all. We'd also want to integrate this with #73496 so that we can use data held in snapshots as the source of most of the data needed for the shrink, rather than copying it between nodes.

In fact another possible approach might be to shrink directly from a snapshot. I think it wouldn't be very hard to adjust the restore process to support this feature. It would perhaps be trickier to integrate with ILM tho because it'd require a good snapshot before doing the shrink. Haven't completely thought through the consequences of this idea, just throwing it out there.

DavidTurner · March 10, 2025, 3:42pm

I had a brief conversation with some of the folks responsible for ILM and they think this would not be too tricky (today's shrink is already kinda tricky for ILM to do, this idea would be better).

tobyb121 · March 10, 2025, 4:08pm

Thanks, I hadn't seen that issue but remote recovery certainly makes sense. I'll take a look at whether this is something I can make some progress on.

Topic		Replies	Views
Graceful shard management? Elasticsearch	6	352	March 18, 2021
Moving all shards in an index to the same node Elasticsearch	4	285	May 29, 2023
Shrink API and Relocation of Shards Elasticsearch	4	891	October 1, 2018
Why shards are not averagely placed in es cluster nodes Elasticsearch	10	2957	July 5, 2017
Shard allocation on restarted node takes too long Elasticsearch	5	3523	July 5, 2017

Multi-node shrink prototype

Related topics