Cluster keeps getting into yellow state and hitting throttled for initializing shards (max rebalance)

carlosmg1 · March 10, 2026, 9:55am

Hi,

we have a cluster with this info:

{

"status": "yellow",

"timed_out": false,

"number_of_nodes": 59,

"number_of_data_nodes": 54,

"active_primary_shards": 6205,

"active_shards": 12409,

"relocating_shards": 7,

"initializing_shards": 0,

"unassigned_shards": 1,

"unassigned_primary_shards": 0,

"delayed_unassigned_shards": 0,

"number_of_pending_tasks": 0,

"number_of_in_flight_fetch": 0,

"task_max_waiting_in_queue_millis": 0,

"active_shards_percent_as_number": 99.99194198227237

}

  "routing": {

    "allocation": {
"cluster_concurrent_rebalance": "2"
    }

  },
"max_shards_per_node": "3000"
},
"search": {

"default_search_timeout": "5m",

"max_async_search_response_size": "50mb"
}
},

"transient": {

"cluster": {

"routing": {

"allocation": {

"cluster_concurrent_rebalance": "5"
    }

Hot nodes are 23 out of those 54 data nodes. Hot nodes are 2TB, cold nodes are 12TB.

The issue lately the cluster is in yellow state quite regularly, it’s keeping constantly rebalancing with 5 nodes, and the initalized nodes stays waiting for the rebalance to start, example:

"can_remain_on_current_node": "yes",

"can_rebalance_cluster": "throttled",

"can_rebalance_cluster_decisions": [
{
"decider": "concurrent_rebalance",

"decision": "THROTTLE",

"explanation": "reached the limit of concurrently rebalancing shards [8], cluster setting [cluster.routing.allocation.cluster_concurrent_rebalance=5]"
}
],

"can_rebalance_to_other_node": "throttled",

"rebalance_explanation": "Elasticsearch is currently busy with other activities. It will rebalance this shard when those activities finish. Please wait.",

"node_allocation_decisions": [

Disk usage for hot nodos is: 30% are above 90%, 60% are above 80%.

Not sure if this behaviour is because of our free space in hot nodes, or maybe we can change the current rebalance value to better fit our needs. AFAWe understand shouldn’t the shard be idle waiting like this for such a long time > 30m

Thanks

Christian_Dahlqvist · March 10, 2026, 11:13am

Which version of Elasticsearch are you using?

DavidTurner · March 10, 2026, 11:32am

This isn’t the explanation for the unassigned shard. You need to look at the right shard to understand why it’s not assigned.

The docs for this setting say

> Increasing this setting may cause the cluster to use additional resources moving shards between nodes, so we generally do not recommend adjusting this setting from its default of 2.

carlosmg1 · March 11, 2026, 10:48am

Hi, thanks for the response

version is 8.19.3

now that I checked again I see 3 unassigned:

.ds-logstash-nginx_access-default-2026.03.11-000043                                                          1     r      UNASSIGNED                                                                                                           INDEX_CREATED
.ds-logstash-nginx_access-default-2026.03.11-000043                                                          2     r      UNASSIGNED                                                                                                           INDEX_CREATED
.ds-logstash-nginx_access-default-2026.03.11-000043                                                          3     r      UNASSIGNED                                                                                                           INDEX_CREATED

the explanation is just no attempt:

  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "INDEX_CREATED",
    "at": "2026-03-11T10:10:38.556Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "yes",

when check the GET /_cat/recovery?v&active_only=true api it shows 5 moving shards with “peer” and without todays day. What I think it means it’s just moving shards around because the disk policy..

some data for hot nodes:

Total free: ~15,088.7 GiB ≈ 14.73 TiB
Total capacity: 35 × 2.0 TiB = 70.0 TiB
Total used: ~55.27 TiB
Average usage: ~79.0%

we changed some of our retention policy to free space, so the cluster is not moving that much shards around after.

You mention the reblance recommendation to 2, is this the same for all clusters size? does it has any consideration? this is an historic decision (from 7 I think), but we increase the size to 5 IIRC because the cluster was normally really bad balanced (due to disks), not sure if after 8 and the balance algorithm change this can be set back to 2

DavidTurner · March 11, 2026, 10:51am

I think it’s not, it’ll also say something like Elasticsearch is currently busy with other activities. It expects to be able to allocate this shard when those activities finish. Please wait. right? I expect it’s busy because you’re allowing it to do too much rebalancing at once.

carlosmg1 · March 20, 2026, 9:22am

Hi David, thanks for your reply.

I understand the message, what’ i don’t quite get is how to balance the amount of rebalance (I know the documentation says 2) with the type of cluster we have. giving our data on space available, how many nodes, how the shards are created (we have from big shards for logs output that are huge, to small shards that almost never get to being rollover because size, it triggers first the date).

So the recommendation of 2 is for every type of cluster? That’s what I’m asking (guidelines for different values). I think with 2 the cluster will just not have enough time to balance the shards.

DavidTurner · March 20, 2026, 10:03am

Yes.

In clusters that generate a lot of new shards (e.g. constantly rolling over time-based indices) they should be allocated in a balanced fashion from the beginning, no rebalancing needed. In clusters that don’t generate a lot of new shards (e.g. long-lived catalog search indices) there should be ample time to rebalance them as needed.

But the main thing is really to set the balancing threshold appropriately. We know that today’s default behaviour is too keen to move shards in larger clusters, and we’re investigating ways to make the default behaviour more widely-applicable, but for now our recommendation remains to relax the threshold.

carlosmg1 · March 20, 2026, 11:01am

thanks for you reply again David, I really appreciated it.

I have one more question, with your last comment, saying the cluster should be balance when generating lot of shards, maybe the issue in our cluster is there, instead. Here some data, this are our own 35 hot nodes:

here some stats:

Mean (Average Free Space): 406.03 GiB
Standard Deviation: 215.72 GiB
Coefficient of Variation (CV): 53.13%
Range: $171.54 GiB to $882.71 GiB

as you see the cluster is quite inbalance, we’re thinking because we have such a wide range of shards values and the algorithm puts more weight into the number of shards inbalance than space inbalance, then some nodes are constantly running out of space thus the move shards.

Not sure, maybe we can change some weight values for our specific use case.

DavidTurner · March 20, 2026, 12:26pm

Yes it’s not trying to balance disk usage, this isn’t a useful metric to consider. See these docs, particularly

It is normal for nodes to temporarily exceed the high watermark from time to time.

and

It is normal for the nodes in your cluster to be using very different amounts of disk space.

Topic		Replies	Views
Just initialize shards when problems but no rebalance Elasticsearch	6	540	January 15, 2015
Is there a way to determine what triggers shard movement? Elasticsearch	10	181	September 4, 2025
Share Rebalancing on large clusters (2.4) Elasticsearch	4	967	December 22, 2016
Cluster Health Yellow - "last_allocation_status" : "no_attempt" Elasticsearch	8	571	April 12, 2024
The rebalance tasks keeps going and not stop Elasticsearch	8	645	August 27, 2021

Cluster keeps getting into yellow state and hitting throttled for initializing shards (max rebalance)

Related topics