Cluster keeps getting into yellow state and hitting throttled for initializing shards (max rebalance)

Hi,

we have a cluster with this info:

{

"status": "yellow",

"timed_out": false,

"number_of_nodes": 59,

"number_of_data_nodes": 54,

"active_primary_shards": 6205,

"active_shards": 12409,

"relocating_shards": 7,

"initializing_shards": 0,

"unassigned_shards": 1,

"unassigned_primary_shards": 0,

"delayed_unassigned_shards": 0,

"number_of_pending_tasks": 0,

"number_of_in_flight_fetch": 0,

"task_max_waiting_in_queue_millis": 0,

"active_shards_percent_as_number": 99.99194198227237

}

  "routing": {

    "allocation": {

"cluster_concurrent_rebalance": "2"

    }

  },

"max_shards_per_node": "3000"

},

"search": {

"default_search_timeout": "5m",

"max_async_search_response_size": "50mb"

}

},

"transient": {

"cluster": {

"routing": {

"allocation": {

"cluster_concurrent_rebalance": "5"

    }

Hot nodes are 23 out of those 54 data nodes. Hot nodes are 2TB, cold nodes are 12TB.

The issue lately the cluster is in yellow state quite regularly, it’s keeping constantly rebalancing with 5 nodes, and the initalized nodes stays waiting for the rebalance to start, example:

"can_remain_on_current_node": "yes",

"can_rebalance_cluster": "throttled",

"can_rebalance_cluster_decisions": [

{

"decider": "concurrent_rebalance",

"decision": "THROTTLE",

"explanation": "reached the limit of concurrently rebalancing shards [8], cluster setting [cluster.routing.allocation.cluster_concurrent_rebalance=5]"

}

],

"can_rebalance_to_other_node": "throttled",

"rebalance_explanation": "Elasticsearch is currently busy with other activities. It will rebalance this shard when those activities finish. Please wait.",

"node_allocation_decisions": [

Disk usage for hot nodos is: 30% are above 90%, 60% are above 80%.

Not sure if this behaviour is because of our free space in hot nodes, or maybe we can change the current rebalance value to better fit our needs. AFAWe understand shouldn’t the shard be idle waiting like this for such a long time > 30m

Thanks

Which version of Elasticsearch are you using?

This isn’t the explanation for the unassigned shard. You need to look at the right shard to understand why it’s not assigned.

The docs for this setting say

> Increasing this setting may cause the cluster to use additional resources moving shards between nodes, so we generally do not recommend adjusting this setting from its default of 2.

1 Like

Hi, thanks for the response

version is 8.19.3

now that I checked again I see 3 unassigned:

.ds-logstash-nginx_access-default-2026.03.11-000043                                                          1     r      UNASSIGNED                                                                                                           INDEX_CREATED
.ds-logstash-nginx_access-default-2026.03.11-000043                                                          2     r      UNASSIGNED                                                                                                           INDEX_CREATED
.ds-logstash-nginx_access-default-2026.03.11-000043                                                          3     r      UNASSIGNED                                                                                                           INDEX_CREATED

the explanation is just no attempt:

  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "INDEX_CREATED",
    "at": "2026-03-11T10:10:38.556Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "yes",

when check the GET /_cat/recovery?v&active_only=true api it shows 5 moving shards with “peer” and without todays day. What I think it means it’s just moving shards around because the disk policy..

some data for hot nodes:

Total free: ~15,088.7 GiB ≈ 14.73 TiB
Total capacity: 35 × 2.0 TiB = 70.0 TiB
Total used: ~55.27 TiB
Average usage: ~79.0%

we changed some of our retention policy to free space, so the cluster is not moving that much shards around after.

You mention the reblance recommendation to 2, is this the same for all clusters size? does it has any consideration? this is an historic decision (from 7 I think), but we increase the size to 5 IIRC because the cluster was normally really bad balanced (due to disks), not sure if after 8 and the balance algorithm change this can be set back to 2

I think it’s not, it’ll also say something like Elasticsearch is currently busy with other activities. It expects to be able to allocate this shard when those activities finish. Please wait. right? I expect it’s busy because you’re allowing it to do too much rebalancing at once.