Balance disk usage across warm nodes

Narek_Martirosyan · September 3, 2025, 1:15pm

Elasticsearch: 9.0.3 (ECK managed)

Kubernetes: AKS
Topology: 2 hot data nodes, 2 warm data nodes, 3 master nodes
Storage: persistent volumes per data node
Workload: APM traces (plus APM logs/metrics); data streams with rollover (~5 GB / ~8 h)
ILM: hot → warm (after ~10 days), then delete after 180 days
Replicas: currently 0 on hot; set to 0 on warm temporarily to reduce pressure

After setting replicas=0 to stabilize, one warm node’s disk keeps getting much fuller than the other:

es-warm-0 disk.total - 393.1gb, disk.used - 343.2gb, disk.avail. - 49.9gb
es-warm-1 disk.total - 393.1gb, disk.used - 298.9gb, disk.avail. - 94.1gb

What’s the recommended way to make the allocator prioritize disk usage so warm nodes converge on similar free space?

Narek_Martirosyan · September 5, 2025, 8:09am

How I resolved it

I managed to balance disk usage across my nodes without overloading the JVM or hitting circuit breakers. Here’s what I did step by step:

Throttle relocations first (avoid heap spikes/circuit breakers during moves)

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.cluster_concurrent_rebalance": "1",
    "cluster.routing.allocation.node_concurrent_incoming_recoveries": "1",
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries": "1",
    "indices.recovery.max_bytes_per_sec": "40mb"
  }
}

Use absolute disk watermarks (react before disks are critically full)
Adjust GB to your disk sizes.

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low":  "25gb",
    "cluster.routing.allocation.disk.watermark.high": "20gb",
    "cluster.routing.allocation.disk.watermark.flood_stage": "10gb"
  }
}

Balance by disk and shard count (keep free space and shard counts close)

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.balance.disk_usage": "0.60",
    "cluster.routing.allocation.balance.shard":      "0.35",
    "cluster.routing.allocation.balance.index":      "0.05"
  }
}

Result: disk usage is now more even across nodes, shard counts per node are close, and JVM stays stable (no circuit breaker trips during rebalancing).

Topic		Replies	Views
Is there a way to rebalance data nodes by disk space and not shards? Elasticsearch	5	4931	July 1, 2021
How to balance data between nodes by disk disk usage % Elasticsearch	1	1984	January 7, 2017
Disk usage not banalced Elasticsearch	2	369	July 6, 2017
Disk Allocation Threshold Elasticsearch	1	451	July 6, 2017
Different hardware capacity Elasticsearch	4	1267	July 6, 2017

Balance disk usage across warm nodes

Related topics