Elasticsearch Cluster Yellow - Index Allocation "No Attempt"

ChestoOfGlen · July 17, 2023, 2:03am

We are running several Elasticsearch clusters (v8.8.1) in Kubernetes (AWS EKS on v1.25) via Elastic Cloud on Kubernetes (ECK v2.8).
We've had several of the clusters, after a high CPU load event on the K8s workers, not allocate some primary shards.

The error the given by _cluster/allocation/explain?pretty on one of the primary shard on one of the indexes gives us:
"last_allocation_status": "no_attempt"
"can_allocate": "yes",
"allocate_explanation": "Elasticsearch can allocate the shard.",

However Elastic does not seem to actually go on and try to allocate it.

Cluster health is otherwise fine. Plenty of disk space and the indexes affected so far are all on hot nodes.
Cluster for this particular example is setup as follows:

3 combination hot/master nodes
3 combination warm/master nodes

{
  "cluster_name": "<REDACTED>",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 6,
  "number_of_data_nodes": 6,
  "active_primary_shards": 788,
  "active_shards": 1577,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 2,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 99.87333755541482
}

I have not been able to find a method to get ES to actually allocate the index shards.
I have been working around this by using _cluster/reroute and the command allocate_empty_primary however that isn't great as it likely results in data loss. So far the indexes we've had affected aren't critically important.

Running _cluster/reroute?retry_failed=true doesn't seem to actually do anything, as the allocation isn't "failed" it is just never attempted.

Any assistance with this would be greatly appreciated.

DavidTurner · July 17, 2023, 4:03pm

Could you share the full output of GET _cluster/allocation/explain and also GET _internal/desired_balance? Use something like https://gist.github.com if they're too long to share here.

ChestoOfGlen · July 19, 2023, 7:12am

Hi David,

GET _internal/desired_balance returns a 500 error message.
Please see both the error message and the output for explain for one of the indexes in question.

{

  "error":{
      "root_cause":[
          {
              "type":"null_pointer_exception",
              "reason":"Cannot invoke \"org.elasticsearch.action.admin.cluster.allocation.DesiredBalanceResponse$ShardAssignmentView.writeTo(org.elasticsearch.common.io.stream.StreamOutput)\" because \"this.desired\" is null"
          }
      ],
      "type":"null_pointer_exception",
      "reason":"Cannot invoke \"org.elasticsearch.action.admin.cluster.allocation.DesiredBalanceResponse$ShardAssignmentView.writeTo(org.elasticsearch.common.io.stream.StreamOutput)\" because \"this.desired\" is null"
  },
  "status":500

}

{
  "index": ".ds-logs-ti_misp.threat-default-2023.07.18-000008",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "INDEX_CREATED",
    "at": "2023-07-18T06:04:05.703Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "yes",
  "allocate_explanation": "Elasticsearch can allocate the shard.",
  "target_node": {
    "id": "UiIl6H2FQrqSUThr1ZQtRg",
    "name": "client-es-masters-hot-1",
    "transport_address": "172.16.195.84:9300",
    "attributes": {
      "ml.allocated_processors": "8",
      "ml.max_jvm_size": "4294967296",
      "xpack.installed": "true",
      "ml.machine_memory": "6442450944",
      "ml.allocated_processors_double": "8.0",
      "k8s_node_name": "ip-10-0-68-59.ap-southeast-2.compute.internal"
    }
  },
  "node_allocation_decisions": [
    {
      "node_id": "UiIl6H2FQrqSUThr1ZQtRg",
      "node_name": "client-es-masters-hot-1",
      "transport_address": "172.16.195.84:9300",
      "node_attributes": {
        "ml.allocated_processors": "8",
        "ml.max_jvm_size": "4294967296",
        "xpack.installed": "true",
        "ml.machine_memory": "6442450944",
        "ml.allocated_processors_double": "8.0",
        "k8s_node_name": "ip-10-0-68-59.ap-southeast-2.compute.internal"
      },
      "node_decision": "yes",
      "weight_ranking": 4
    },
    {
      "node_id": "30V_RblrRSaYKprD4iJ5og",
      "node_name": "client-es-masters-hot-0",
      "transport_address": "172.16.10.137:9300",
      "node_attributes": {
        "ml.max_jvm_size": "4294967296",
        "xpack.installed": "true",
        "ml.machine_memory": "6442450944",
        "ml.allocated_processors_double": "8.0",
        "ml.allocated_processors": "8",
        "k8s_node_name": "ip-10-0-115-74.ap-southeast-2.compute.internal"
      },
      "node_decision": "yes",
      "weight_ranking": 5
    },
    {
      "node_id": "XfjyU7pGSVOFmjqKJAngyA",
      "node_name": "client-es-masters-hot-2",
      "transport_address": "172.16.235.221:9300",
      "node_attributes": {
        "xpack.installed": "true",
        "ml.machine_memory": "6442450944",
        "ml.allocated_processors_double": "8.0",
        "ml.max_jvm_size": "4294967296",
        "ml.allocated_processors": "8",
        "k8s_node_name": "ip-10-0-82-81.ap-southeast-2.compute.internal"
      },
      "node_decision": "yes",
      "weight_ranking": 6
    },
    {
      "node_id": "HnfWmpd3Ria9xfvH5Y88FQ",
      "node_name": "client-es-masters-warm-ebs-2",
      "transport_address": "172.16.175.150:9300",
      "node_attributes": {
        "k8s_node_name": "ip-10-0-64-101.ap-southeast-2.compute.internal",
        "xpack.installed": "true"
      },
      "node_decision": "no",
      "weight_ranking": 1,
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_hot] and node does not meet the required [data_hot] tier"
        }
      ]
    },
    {
      "node_id": "sCHJV5mmQrCCcVAavWc-qg",
      "node_name": "client-es-masters-warm-ebs-0",
      "transport_address": "172.16.213.107:9300",
      "node_attributes": {
        "k8s_node_name": "ip-10-0-117-195.ap-southeast-2.compute.internal",
        "xpack.installed": "true"
      },
      "node_decision": "no",
      "weight_ranking": 2,
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_hot] and node does not meet the required [data_hot] tier"
        }
      ]
    },
    {
      "node_id": "CdzzX7_5RvONdkjWU10T2w",
      "node_name": "client-es-masters-warm-ebs-1",
      "transport_address": "172.16.69.250:9300",
      "node_attributes": {
        "k8s_node_name": "ip-10-0-80-196.ap-southeast-2.compute.internal",
        "xpack.installed": "true"
      },
      "node_decision": "no",
      "weight_ranking": 3,
      "deciders": [
        {
          "decider": "data_tier",
          "decision": "NO",
          "explanation": "index has a preference for tiers [data_hot] and node does not meet the required [data_hot] tier"
        }
      ]
    }
  ]
}

DavidTurner · July 19, 2023, 8:21am

Oh, hmm, that error is not helpful at all. I opened Fix NPE in Desired Balance API by DaveCTurner · Pull Request #97775 · elastic/elasticsearch · GitHub to fix it.

Is there anything in the logs to suggest that something related to allocation has unrecoverably failed? If you do something to disturb the allocation (e.g. create another new index) does it resolve the situation? Can you try DELETE /_internal/desired_balance? If none of that works, could you trigger a master failover (e.g. temporarily disconnect the current master)?

I would not expect allocate_empty_primary to help but I believe this index is brand-new and never allocated so it also won't lose anything.

ChestoOfGlen · July 20, 2023, 5:12am

Hi David,

There wasn't anything specific in the logs aside from there being periodic errors and stack traces about the null pointer exception.
Trying a DELETE as suggested didn't seem to do anything except return the same 500 error.
Attempting to create a new index likewise didn't result in any change except it also not getting allocated.

However, what has seemed to resolve the situation is killing one of the master pods in Kubernetes and letting it re-create. That got the desired_balance endpoint to start returning a valid response. Then all the currently unassigned shards were successfully routed and assigned by the cluster.

Thanks for your assistance with this. We can call this one solved.

DavidTurner · July 20, 2023, 5:45am

Glad to hear you're up and running again. I'm not calling it "solved" so much as "worked around", it's definitely a bug to get into this state in the first place. We'll continue trying to work out how that happened.

system · August 17, 2023, 5:46am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch isn't allowed to allocate this shard to any of the nodes in the cluster Elasticsearch	12	7068	December 15, 2022
Cluster Health Yellow - "last_allocation_status" : "no_attempt" Elasticsearch	8	442	April 12, 2024
[Elasticsearch 7.6] How do I fix unassigned shards issue Elasticsearch	8	3490	May 18, 2020
ElasticSearch creating new index is unassigned even though it says that it can allocate Elasticsearch	1	138	March 14, 2024
Shards are not allocating to available node Elasticsearch	2	1274	October 19, 2018

Elasticsearch Cluster Yellow - Index Allocation "No Attempt"

Related topics