Cluster Health Yellow - "last_allocation_status" : "no_attempt"

Hi, we run several large Elasticsearch Clusters (v8.12.1) in Kubernetes (AWS EKS). We've had several clusters fail to assign a single replica shard during rolling restarts of datanodes. (There was a similar post for v8.8.1. We are running a version already including the NPE fix mentioned in that post. I will post a link in the next message, as I've hit the per-post link limit.)

We see this occur occasionally whenever we perform rolling restarts of datanodes. Most restarts are successful, but in rare cases, clusters get into a state where a single replica shard remains unassigned and Elasticsearch makes no attempt to assign it. We are running with index.unassigned.node_left.delayed_timeout set to 30m. When we hit this case, we see a datanode leave the cluster and rejoin normally, but 1 shard remains in delayed_unassigned and eventually unassigned.

We gathered the following info from an impacted cluster for debugging. In this case, the relevant shard was previously hosted by the node with the name tracer-es-cell4-data-39.tracer-es-cell4-data.tracer.svc.cluster.local and ID QTMusz3rTiy-8qgwT81SJA.

  1. /_cluster/health?pretty - Cluster Health Yellow - "last_allocation_status" : "no_attempt" · GitHub
    • shows a single unassigned shard.
  2. /_cluster/allocation/explain?pretty - Cluster Health Yellow - "last_allocation_status" : "no_attempt" · GitHub
    • shows shard details, along with "last_allocation_status" : "no_attempt" and "can_allocate" : "yes"
  3. /_internal/desired_balance?pretty - Cluster Health Yellow - "last_allocation_status" : "no_attempt" · GitHub
    • shows 0 unassigned shards in the stats. Checking under the "routing_table.tracer-apm-span-2024-04-09.7" path shows 2 the shard is in state UNASSIGNED
    • I did extract the per-node shard counts from the "nodes" path and found that 4521 shards were accounted for.
  4. /_cat/allocation?v - Cluster Health Yellow - "last_allocation_status" : "no_attempt" · GitHub
    • shows 1 unassigned shard and 4521 shards assigned to nodes
  5. /_cat/shards?v - Cluster Health Yellow - "last_allocation_status" : "no_attempt" · GitHub
    • shows index tracer-apm-span-2024-04-09, shard 7, replica unassigned

From reading the other post mentioned above, we tried to reset the desired balance by issuing an DELETE /_internal/desired_balance. We found this did result in the shard being assigned and cluster health returning to green. We did not attempt to restart the active master. We also restarted both the old datanode hosting the shard and the suggested target node and saw this had no effect.

Other details that may potentially be relevant:

  • There are several active indices on these clusters for different use cases, but this seems to always occur on the index for the same use case. This index is relatively small, but sees a lot of updates to existing documents.
  • The clusters have 75 datanodes and this index runs with 10 shards and 2 replicas (1 primary and two replicas). This index has index.routing.allocation.total_shards_per_node see to 1.
  • We have not seen this impact any primary shards.

Thanks for any help.

The similar related post - Elasticsearch Cluster Yellow - Index Allocation "No Attempt"

The issue is that the unassigned shard wants to be on one of these nodes:

$ cat 3_desired_balance.json | jq '.routing_table["tracer-apm-span-2024-04-09"]["7"].desired.node_ids[]' -cMr
O2wmx_FXRdyo5kisO0kd2A
i4UC8i55QRGnKtYrdyBZjg
PqpNo7roSv2I_6EhoyygzQ

but all of those nodes are currently unavailable for one reason or another:

$ cat 2_cluster_allocation_explain.json | jq '.node_allocation_decisions[] | select(.node_id == "O2wmx_FXRdyo5kisO0kd2A" or .node_id == "i4UC8i55QRGnKtYrdyBZjg" or .node_id == "PqpNo7roSv2I_6EhoyygzQ") | .deciders[]'
{
  "decider": "same_shard",
  "decision": "NO",
  "explanation": "a copy of this shard is already allocated to this node [[tracer-apm-span-2024-04-09][7], node[i4UC8i55QRGnKtYrdyBZjg], [P], s[STARTED], a[id=bZqqQzK-RDun0X3ClCeHog], failed_attempts[0]]"
}
{
  "decider": "shards_limit",
  "decision": "NO",
  "explanation": "too many shards [1] allocated to this node for index [tracer-apm-span-2024-04-09], index setting [index.routing.allocation.total_shards_per_node=1]"
}
{
  "decider": "awareness",
  "decision": "NO",
  "explanation": "there are [3] copies of this shard and [3] values for attribute [zone] ([us-east-2a, us-east-2b, us-east-2c] from nodes in the cluster and no forced awareness) so there may be at most [1] copies of this shard allocated to nodes with each value, but (including this copy) there would be [2] copies allocated to nodes with [node.attr.zone: us-east-2c]"
}
{
  "decider": "awareness",
  "decision": "NO",
  "explanation": "there are [3] copies of this shard and [3] values for attribute [zone] ([us-east-2a, us-east-2b, us-east-2c] from nodes in the cluster and no forced awareness) so there may be at most [1] copies of this shard allocated to nodes with each value, but (including this copy) there would be [2] copies allocated to nodes with [node.attr.zone: us-east-2b]"
}
{
  "decider": "shards_limit",
  "decision": "NO",
  "explanation": "too many shards [1] allocated to this node for index [tracer-apm-span-2024-04-09], index setting [index.routing.allocation.total_shards_per_node=1]"
}

Fundamentally this is a limitation of setting such a restrictive index.routing.allocation.total_shards_per_node, see these docs:

WARNING: These settings impose a hard limit which can result in some shards not being allocated.

However I suspect you could work around this by setting cluster.routing.allocation.allow_rebalance: always (docs) which would permit Elasticsearch to move some of the other shards around to make room for the unassigned one.

Thanks @DavidTurner for this insights and documentation, it is very helpful to understand why the shard was not allocated. I do have some follow up questions.

We use shard allocation awareness (docs) with a custom attribute of the AZ of the datanode in order to ensure that the 3 copies of shard are spread across 3 AZs.

We are running with a restrictive index.routing.allocation.total_shards_per_node setting of 1 in order to spread load across our datanodes and prevent hot spotting.

Our cluster consists of 75 datanodes spread equally across 3 AZs:

$ cat 2_cluster_allocation_explain.json | jq '.node_allocation_decisions[].node_attributes.zone' | sort | uniq -c                                                                            11:44:05
  25 "us-east-2a"
  25 "us-east-2b"
  25 "us-east-2c"

You can understand why we did not land on the conclusion that this was due to the index.routing.allocation.total_shards_per_node setting, as there should be sufficient resources to allocate all shards for this index.

  • This index is configured with 10 shards and two replicas, so there are 30 total shards to allocate.
  • With the shard allocation awareness, there are 10 shards to assign per each AZ.
  • There are 25 datanodes in each AZ.
  • 10 < 25, so we are guaranteed that there is a valid configuration in which all shards can be allocated.

From a user's perspective (ignorant to the internal interactions between the responses from the _cluster/allocation/explain and _internal/desired_balance), this very is confusing and appears to be improper behavior:

  1. The response from _cluster/allocation/explain clearly states the shard can be allocated and suggests a valid target node, and yet the shard remains unassigned. This begs the question "Why is this shard not being allocated to the target node?". We can issue an allocate_replica command with the suggested target node to the _cluster/reroute API and get the shard allocated... why didn't the cluster do that automatically?
  2. Why is the _internal/desired_balance endpoint suggesting a configuration that is invalid? Is it because the code that determined the desired balance configuration is ignorant of restrictions on the shard allocation? If this is the case, then I would suggest an enhancement to ensure that generated configurations are valid before accepting them as the desired state.

I understand the complexities in finding an optimal balance in a large cluster state, but I would think the priority should first be allocating all shards, and only then allocating all shards in an optimally balanced configuration. Surely a green, unbalanced cluster should be preferred to a yellow, unbalanced cluster?

That's true for allocating primaries but not so much for replicas. I mean I understand what you're saying, but in situations where there's 10k+ (50k+?) missing shards it's definitely worse to allocate them all eagerly to the wrong nodes and then shuffle them around than it is to hold back a bit and put them in the right places to start with. It's basically impossible to get the allocation heuristics to do the obviously sensible thing in all situations.

#98710 would fix the confusing response from GET _cluster/allocation/explain in this case, we'll get around to that at some point.

The desired state isn't invalid, it's just that the default for cluster.routing.allocation.allow_rebalance gets in the way when total_shards_per_node is in play. TBH cluster.routing.allocation.allow_rebalance: always makes a lot more sense in recent versions, we might even make it the default eventually.

Thanks @DavidTurner for the clarification, I think I understand now. The suggested desired balance for this index is a valid state. I see that there is no overlap in the desired nodes for each shard (and I presume the three desired nodes for each shard are AZ balanced). So the index.routing.allocation.total_shards_per_node setting of 1 is being respected.

$ cat 3_desired_balance.json | jq '.routing_table["tracer-apm-span-2024-04-09"][].desired.node_ids[]' -r | sort | uniq -c
   1 4pD-EOxzQv-r_-P3V_6LpA
   1 5HeUkcRHQIK7DH3NnfGqjg
   1 8O6sPjGPTpqXCicSo3p1pw
   1 BEpTes9XT-aEQrekQbxksg
   1 BFFD276bTZ2tOjhH4zUQIA
   1 DZJmLId7TyGs817VRd-wUg
   1 HXTx_YD9RvGNRqkue93wpg
   1 JUWiGWrcRvC2w-XEwB8JkQ
   1 KtauELCVSPmHZ1Pd8XKdyA
   1 LwC7KrgzQ1ikoUYu5lVCsQ
   1 MooHRg92QfKg168KUD2GRQ
   1 O2wmx_FXRdyo5kisO0kd2A
   1 PqpNo7roSv2I_6EhoyygzQ
   1 Rv_B8EfYSGWVbdC-CQC4Ow
   1 S_Y7TKIDSw2uP18Duvtspg
   1 Sdd0ZKzpTv6KL8CzFIcNUQ
   1 TVlcuLvmSqiS2BPplMnCrg
   1 YGfxG-nKSUafJzX_9zFZ9A
   1 fkFei9y4RASkHH2NECChVA
   1 h2ZwKUeVSdqhZLrUHCRg9Q
   1 i4UC8i55QRGnKtYrdyBZjg
   1 i4p9WFgKQV6vudWRsB7jvA
   1 k4lIzuEMREy6enrFQT904g
   1 o5j7ul1zROyHWXh7Cu38KA
   1 uURkeZmKQdWnHShw8GAWQA
   1 udbAp7PkRkK5qdqBbGLmNg
   1 w9nU8I24RK-ldjm-pFD-zA
   1 xHGrdW53QVyCLNUYnwThxg
   1 yf-maW5hSpyDqQFlouI16w
   1 zNTZMJCcROK3bPWSAnv4YQ

The problem is that there is no way for the cluster to allocate the unassigned shard to one of the three desired nodes for it, as there is another shard for the index present there.

We can check what shard is "in the way":

$ cat 5_cat_shards.json | grep tracer-es-cell4-data-24.tracer-es-cell4-data.tracer.svc.cluster.local  | grep tracer-apm-span-2024-04-09
tracer-apm-span-2024-04-09                                    4     p      STARTED        1631  13.2mb  13.2mb 10.69.18.9    tracer-es-cell4-data-24.tracer-es-cell4-data.tracer.svc.cluster.local

From the suggested desired balance, this shard is not on one of its desired nodes. However, the cluster cannot move it because, under the default settings, rebalancing is not allowed while any shard (primary or replica) is unassigned. Hence we reach a deadlock state.

Setting cluster.routing.allocation.allow_rebalance: always would break the deadlock by allowing the "in the way" shard to be moved to one of its desired nodes.

I just have a small concern regarding the default cluster.routing.allocation.node_concurrent_recoveries: 2 and cluster.routing.allocation.cluster_concurrent_rebalance: 2 settings (which we are using), but just want clarification that the allocation of the unassigned shard would qualify as only a "recovery" while the now allowed rebalancing movements would qualify as a "rebalance" (and a "recovery"). So the recovery of the unassigned shard would only be blocked if 2 concurrent recoveries were blocking the specific desired node, and NOT if there were 2 concurrent rebalances ongoing anywhere in the cluster. (My concern is that a long queue of rebalancing operations would block recovery of unassigned shards.)

That's correct. In fact even if there were lots of shard movements involving the specific desired node it still would prefer to assign the unassigned shard as soon as it could. The order of precedence is:

  1. assigning unassigned shards
  2. shard movements which are forced by allocation rules
  3. rebalancing movements, i.e. unforced moves to put shards in a more desirable location

Excellent! Thanks for your stellar support here, this was very helpful and I am confident we have the info we need to resolve this.

As a closing thought, I would just re-iterate from an operator's perspective that the discrepancy between the target node in the response from _cluster/allocation/explain and the desired nodes in the response from _internal/desired_balance is incredibly confusing. One response says "I can allocate this shard on node A, but I'm refraining from doing (with no reason provided)" and the other says "Nodes X, Y, Z are desired for this shard".

I am entirely ignorant of the internal interactions and complexities in the code that generates these two responses, but I would think that the target node should only ever list one of X, Y, Z. If the cluster isn't going to attempt to allocate the shard to A, then it shouldn't suggest it as a target node.

100% agree. #98710 will fix that.