Elasticsearch Cluster Yellow - Index Allocation "No Attempt"

We are running several Elasticsearch clusters (v8.8.1) in Kubernetes (AWS EKS on v1.25) via Elastic Cloud on Kubernetes (ECK v2.8).
We've had several of the clusters, after a high CPU load event on the K8s workers, not allocate some primary shards.

The error the given by _cluster/allocation/explain?pretty on one of the primary shard on one of the indexes gives us:
"last_allocation_status": "no_attempt"
"can_allocate": "yes",
"allocate_explanation": "Elasticsearch can allocate the shard.",

However Elastic does not seem to actually go on and try to allocate it.

Cluster health is otherwise fine. Plenty of disk space and the indexes affected so far are all on hot nodes.
Cluster for this particular example is setup as follows:

  • 3 combination hot/master nodes
  • 3 combination warm/master nodes
{
  "cluster_name": "<REDACTED>",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 6,
  "number_of_data_nodes": 6,
  "active_primary_shards": 788,
  "active_shards": 1577,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 2,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 99.87333755541482
}

I have not been able to find a method to get ES to actually allocate the index shards.
I have been working around this by using _cluster/reroute and the command allocate_empty_primary however that isn't great as it likely results in data loss. So far the indexes we've had affected aren't critically important.

Running _cluster/reroute?retry_failed=true doesn't seem to actually do anything, as the allocation isn't "failed" it is just never attempted.

Any assistance with this would be greatly appreciated.

Could you share the full output of GET _cluster/allocation/explain and also GET _internal/desired_balance? Use something like https://gist.github.com if they're too long to share here.

1 Like

Hi David,

GET _internal/desired_balance returns a 500 error message.
Please see both the error message and the output for explain for one of the indexes in question.

{

  "error":{
      "root_cause":[
          {
              "type":"null_pointer_exception",
              "reason":"Cannot invoke \"org.elasticsearch.action.admin.cluster.allocation.DesiredBalanceResponse$ShardAssignmentView.writeTo(org.elasticsearch.common.io.stream.StreamOutput)\" because \"this.desired\" is null"
          }
      ],
      "type":"null_pointer_exception",
      "reason":"Cannot invoke \"org.elasticsearch.action.admin.cluster.allocation.DesiredBalanceResponse$ShardAssignmentView.writeTo(org.elasticsearch.common.io.stream.StreamOutput)\" because \"this.desired\" is null"
  },
  "status":500

}
{
  "index": ".ds-logs-ti_misp.threat-default-2023.07.18-000008",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "INDEX_CREATED",
    "at": "2023-07-18T06:04:05.703Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "yes",
  "allocate_explanation": "Elasticsearch can allocate the shard.",
  "target_node": {
    "id": "UiIl6H2FQrqSUThr1ZQtRg",
    "name": "client-es-masters-hot-1",
    "transport_address": "172.16.195.84:9300",
    "attributes": {
      "ml.allocated_processors": "8",
      "ml.max_jvm_size": "4294967296",
      "xpack.installed": "true",
      "ml.machine_memory": "6442450944",
      "ml.allocated_processors_double": "8.0",
      "k8s_node_name": "ip-10-0-68-59.ap-southeast-2.compute.internal"
    }
  },
  "node_allocation_decisions": [
    {
      "node_id": "UiIl6H2FQrqSUThr1ZQtRg",
      "node_name": "client-es-masters-hot-1",
      "transport_address": "172.16.195.84:9300",
      "node_attributes": {
        "ml.allocated_processors": "8",
        "ml.max_jvm_size": "4294967296",
        "xpack.installed": "true",
        "ml.machine_memory": "6442450944",
        "ml.allocated_processors_double": "8.0",
        "k8s_node_name": "ip-10-0-68-59.ap-southeast-2.compute.internal"
      },
      "node_decision": "yes",
      "weight_ranking": 4
    },
    {
      "node_id": "30V_RblrRSaYKprD4iJ5og",
      "node_name": "client-es-masters-hot-0",
      "transport_address": "172.16.10.137:9300",
      "node_attributes": {
        "ml.max_jvm_size": "4294967296",
        "xpack.installed": "true",
        "ml.machine_memory": "6442450944",
        "ml.allocated_processors_double": "8.0",
        "ml.allocated_processors": "8",
        "k8s_node_name": "ip-10-0-115-74.ap-southeast-2.compute.internal"
      },
      "node_decision": "yes",
      "weight_ranking": 5
    },
    {
      "node_id": "XfjyU7pGSVOFmjqKJAngyA",
      "node_name": "client-es-masters-hot-2",
      "transport_address": "172.16.235.221:9300",
      "node_attributes": {
        "xpack.installed": "true",
        "ml.machine_memory": "6442450944",
        "ml.allocated_processors_double": "8.0",
        "ml.max_jvm_size": "4294967296",
        "ml.allocated_processors": "8",
        "k8s_node_name": "ip-10-0-82-81.ap-southeast-2.compute.internal"
      },
      "node_decision": "yes",
      "weight_ranking": 6
    },
    {
      "node_id": "HnfWmpd3Ria9xfvH5Y88FQ",
      "node_name": "client-es-masters-warm-ebs-2",
      "transport_address": "172.16.175.150:9300",
      "node_attributes": {
        "k8s_node_name": "ip-10-0-64-101.ap-southeast-2.compute.internal",
        "xpack.installed": "true"
      },
      "node_decision": "no",
      "weight_ranking": 1,
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_hot] and node does not meet the required [data_hot] tier"
        }
      ]
    },
    {
      "node_id": "sCHJV5mmQrCCcVAavWc-qg",
      "node_name": "client-es-masters-warm-ebs-0",
      "transport_address": "172.16.213.107:9300",
      "node_attributes": {
        "k8s_node_name": "ip-10-0-117-195.ap-southeast-2.compute.internal",
        "xpack.installed": "true"
      },
      "node_decision": "no",
      "weight_ranking": 2,
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_hot] and node does not meet the required [data_hot] tier"
        }
      ]
    },
    {
      "node_id": "CdzzX7_5RvONdkjWU10T2w",
      "node_name": "client-es-masters-warm-ebs-1",
      "transport_address": "172.16.69.250:9300",
      "node_attributes": {
        "k8s_node_name": "ip-10-0-80-196.ap-southeast-2.compute.internal",
        "xpack.installed": "true"
      },
      "node_decision": "no",
      "weight_ranking": 3,
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_hot] and node does not meet the required [data_hot] tier"
        }
      ]
    }
  ]
}

Oh, hmm, that error is not helpful at all. I opened Fix NPE in Desired Balance API by DaveCTurner · Pull Request #97775 · elastic/elasticsearch · GitHub to fix it.

Is there anything in the logs to suggest that something related to allocation has unrecoverably failed? If you do something to disturb the allocation (e.g. create another new index) does it resolve the situation? Can you try DELETE /_internal/desired_balance? If none of that works, could you trigger a master failover (e.g. temporarily disconnect the current master)?

I would not expect allocate_empty_primary to help but I believe this index is brand-new and never allocated so it also won't lose anything.

Hi David,

There wasn't anything specific in the logs aside from there being periodic errors and stack traces about the null pointer exception.
Trying a DELETE as suggested didn't seem to do anything except return the same 500 error.
Attempting to create a new index likewise didn't result in any change except it also not getting allocated.

However, what has seemed to resolve the situation is killing one of the master pods in Kubernetes and letting it re-create. That got the desired_balance endpoint to start returning a valid response. Then all the currently unassigned shards were successfully routed and assigned by the cluster.

Thanks for your assistance with this. We can call this one solved.

Glad to hear you're up and running again. I'm not calling it "solved" so much as "worked around", it's definitely a bug to get into this state in the first place. We'll continue trying to work out how that happened.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.