How I resolved it
I managed to balance disk usage across my nodes without overloading the JVM or hitting circuit breakers. Here’s what I did step by step:
- Throttle relocations first (avoid heap spikes/circuit breakers during moves)
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.cluster_concurrent_rebalance": "1",
"cluster.routing.allocation.node_concurrent_incoming_recoveries": "1",
"cluster.routing.allocation.node_concurrent_outgoing_recoveries": "1",
"indices.recovery.max_bytes_per_sec": "40mb"
}
}
- Use absolute disk watermarks (react before disks are critically full)
Adjust GB to your disk sizes.
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "25gb",
"cluster.routing.allocation.disk.watermark.high": "20gb",
"cluster.routing.allocation.disk.watermark.flood_stage": "10gb"
}
}
- Balance by disk and shard count (keep free space and shard counts close)
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.balance.disk_usage": "0.60",
"cluster.routing.allocation.balance.shard": "0.35",
"cluster.routing.allocation.balance.index": "0.05"
}
}
Result: disk usage is now more even across nodes, shard counts per node are close, and JVM stays stable (no circuit breaker trips during rebalancing).