I would like to add my solution to Moving shards is slow
The solution has two parts:
- Ensure the primaries are balanced over all the nodes so the recovery load is well distributed. This was done with an external shard balancer: It swaps primary shards with replicas (using a series of moves) until they are evenly distributed. This increases the total shard movement, but decreases recovery time significantly when it is needed.
- Recover only one-shard-per-node-at-a-time, only one-primary-per-node-at-a-time. Something terrible happens when you ask a node to recover/move more than one shard at a time. Maybe it is a classic "it takes twice as long to do twice as much", but it seems much worse than that. When shards must recover, we track which pair of nodes (source node and destination node) is involved and ensure those nodes are not already busy with another shard recovery. By blocking recoveries until the nodes are free, the overall recovery time is reduced significantly. This technique is also done with the same external shard balancer.
Using this technique, shard recovery can be reduced from a week (or never, in some pathological shard/node cases), to a couple of days. The cluster went from constantly yellow, to almost always green. Life is good now.
You can also turn off ingestion during recovery: It noticeably improves recovery time, but that is common knowledge.
Note: The cluster is about 40 EC2 nodes, 4K shards (2K primary shards), median shard size is about 20Gb, total primaries sum to 35Tb on disk.