Constant Unassigned Shards

Hello,

I have Elastic Cluster with the following Configuration:

3 Master Nodes (JVM Heap Ram 4 GB, 1 CPU, 1 GB Disk)
10 Data/Ingest Nodes (JVM Heap Ram 8 GB, 2 CPU, 500 GB Disk)

~200 Indexes
Each Index has 2 Replicas
~600 Shards
Each Data Node has ~60 Shards

For the recent month I am observing constantly there are several (1-5) Shards unassigned, which keeps Cluster in Yellow status.

_cluster/allocation endpoint shows the following:

{
  "index" : "prod-2021.06.09",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2021-06-09T11:13:10.137Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [6_nnUSedQiaOKpN72xsxiQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[prod-2021.06.09][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "3vFljpB3S6qRMivGz2wg1g",
      "node_name" : "-prod-es-data-6",
      "transport_address" : ":9300",
      "node_attributes" : {
        "ml.machine_memory" : "12884901888",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-06-09T11:13:10.137Z], failed_attempts[5], delayed=false, details[failed shard on node [6_nnUSedQiaOKpN72xsxiQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[prod-2021.06.09][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "4dRmDv9NQDCXucdrOrH9mw",
      "node_name" : "-prod-es-data-8",
      "transport_address" : ":9300",
      "node_attributes" : {
        "ml.machine_memory" : "12884901888",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-06-09T11:13:10.137Z], failed_attempts[5], delayed=false, details[failed shard on node [6_nnUSedQiaOKpN72xsxiQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[prod-2021.06.09][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "6_nnUSedQiaOKpN72xsxiQ",
      "node_name" : "-prod-es-data-1",
      "transport_address" : ":9300",
      "node_attributes" : {
        "ml.machine_memory" : "12884901888",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-06-09T11:13:10.137Z], failed_attempts[5], delayed=false, details[failed shard on node [6_nnUSedQiaOKpN72xsxiQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[prod-2021.06.09][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[prod-2021.06.09][0], node[6_nnUSedQiaOKpN72xsxiQ], [R], s[STARTED], a[id=B_6iBKySSF2U1PO-HESdNg]]"
        }
      ]
    },
    {
      "node_id" : "C2PBubtXQayDfNbV8SO3dA",
      "node_name" : "-prod-es-data-3",
      "transport_address" : ":9300",
      "node_attributes" : {
        "ml.machine_memory" : "12884901888",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-06-09T11:13:10.137Z], failed_attempts[5], delayed=false, details[failed shard on node [6_nnUSedQiaOKpN72xsxiQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[prod-2021.06.09][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "KNapgBKETTOpKIiaKcs18Q",
      "node_name" : "-prod-es-data-9",
      "transport_address" : ":9300",
      "node_attributes" : {
        "ml.machine_memory" : "12884901888",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-06-09T11:13:10.137Z], failed_attempts[5], delayed=false, details[failed shard on node [6_nnUSedQiaOKpN72xsxiQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[prod-2021.06.09][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[prod-2021.06.09][0], node[KNapgBKETTOpKIiaKcs18Q], [P], s[STARTED], a[id=FBCUV0sRSo6KIzxBv1nqAw]]"
        }
      ]
    },
    {
      "node_id" : "NewSPDMKShyAr_Aa8iPs2Q",
      "node_name" : "-prod-es-data-0",
      "transport_address" : ":9300",
      "node_attributes" : {
        "ml.machine_memory" : "12884901888",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-06-09T11:13:10.137Z], failed_attempts[5], delayed=false, details[failed shard on node [6_nnUSedQiaOKpN72xsxiQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[prod-2021.06.09][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "Om3F0aVgTaSL32ye7z477A",
      "node_name" : "-prod-es-data-5",
      "transport_address" : ":9300",
      "node_attributes" : {
        "ml.machine_memory" : "12884901888",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-06-09T11:13:10.137Z], failed_attempts[5], delayed=false, details[failed shard on node [6_nnUSedQiaOKpN72xsxiQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[prod-2021.06.09][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "cT_alrWFTEWXpZVlclCeRA",
      "node_name" : "-prod-es-data-7",
      "transport_address" : ":9300",
      "node_attributes" : {
        "ml.machine_memory" : "12884901888",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-06-09T11:13:10.137Z], failed_attempts[5], delayed=false, details[failed shard on node [6_nnUSedQiaOKpN72xsxiQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[prod-2021.06.09][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]]]"
        }
      ]
    }
  ]
}

During the day both on Master and Data Nodes JVM Heap once or several times per hour has spikes, so JVM Heap Usage is ~90-100%, which also leads to unstable cluster behaviour.

Also such exceptions also appear while monitoring in Kibana

[parent] Data too large, data for [<http_request>] would be [8207642762/7.6gb], which is larger than the limit of [8143876915/7.5gb], real usage: [8207642456/7.6gb], new bytes reserved: [306/306b], usages [request=72/72b, fielddata=19223/18.7kb, in_flight_requests=70836/69.1kb, accounting=70341646/67mb]: [circuit_breaking_exception] [parent] Data too large, data for [<http_request>] would be [8207642762/7.6gb], which is larger than the limit of [8143876915/7.5gb], real usage: [8207642456/7.6gb], new bytes reserved: [306/306b], usages [request=72/72b, fielddata=19223/18.7kb, in_flight_requests=70836/69.1kb, accounting=70341646/67mb], with { bytes_wanted=8207642762 & bytes_limit=8143876915 & durability="PERMANENT" }: Check the Elasticsearch Monitoring cluster network connection or the load level of the nodes. 

Could you please suggest

  • What could cause such constant Shards unassignments?
  • How # of Indexes, Shards, Replicas, # of Master and Data Nodes, Ram Size could have an influence on such behaviour?
  • What could cause constant spikes in JVM Heap Usage during the night, taking into consideration that load is much less and more constant then during the day?
  • What additional information can I provide?

Thank you, Sasha

Which version of Elasticsearch are you using?

What type of hardware is this cluster deployed on? What type of storage are you using?

What is the total data volume in the cluster?

What is the full output of the cluster stats API?

Which version of Elasticsearch are you using?
Version 7.4.2

What type of hardware is this cluster deployed on? What type of storage are you using?
Cluster is deployed with Azure Kubernetes Service, version 1.19.7
Node Pool with 8 Nodes with Standard_B8ms Node Size

|Size|vCPU|Memory: GiB|Temp storage (SSD) GiB|Base CPU Perf of VM|Max CPU Perf of VM|Initial Credits|Credits banked/hour|Max Banked Credits|Max data disks|Max cached and temp storage throughput: IOPS/MBps|Max uncached disk throughput: IOPS/MBps|Max NICs|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|Standard_B1ls1|1|0.5|4|5%|100%|30|3|72|2|200/10|160/10|2|
|Standard_B1s|1|1|4|10%|100%|30|6|144|2|400/10|320/10|2|
|Standard_B1ms|1|2|4|20%|100%|30|12|288|2|800/10|640/10|2|
|Standard_B2s|2|4|8|40%|200%|60|24|576|4|1600/15|1280/15|3|
|Standard_B2ms|2|8|16|60%|200%|60|36|864|4|2400/22.5|1920/22.5|3|
|Standard_B4ms|4|16|32|90%|400%|120|54|1296|8|3600/35|2880/35|4|
|Standard_B8ms|8|32|64|135%|800%|240|81|1944|16|4320/50|4320/50|4|
|Standard_B12ms|12|48|96|202%|1200%|360|121|2909|16|6480/75|4320/50|6|
|Standard_B16ms|16|64|128|270%|1600%|480|162|3888|32|8640/100|4320/50|8|
|Standard_B20ms|20|80|160|337%|2000%|600|203|4860|32|10800/125|4320/50|8|

Master Nodes - Storage class - default - 1 GB
Data Nodes - Storage class - managed-premium - 500 GB

What is the total data volume in the cluster?

shards disk.indices disk.used disk.avail disk.total disk.percent node
    59      292.7gb   292.8gb    198.2gb    491.1gb           59 -prod-es-data-9
    59      282.1gb   282.2gb    208.8gb    491.1gb           57 -prod-es-data-8
    60      339.5gb   343.4gb    148.6gb      492gb           69 -prod-es-data-6
    58      158.3gb   166.1gb    325.9gb      492gb           33 -prod-es-data-3
    58      217.3gb   225.6gb    266.3gb      492gb           45 -prod-es-data-4
    58      236.8gb   280.4gb    211.5gb      492gb           57 -prod-es-data-2
    58      339.5gb   341.8gb    150.1gb      492gb           69 -prod-es-data-0
    60      377.5gb   411.7gb     80.3gb      492gb           83 -prod-es-data-1
    58      375.8gb     380gb    111.9gb      492gb           77 -prod-es-data-7
    57      376.1gb   378.5gb    113.4gb      492gb           76 -prod-es-data-5

What is the full output of the cluster stats API?

{
  "_nodes": {
    "total": 13,
    "successful": 13,
    "failed": 0
  },
  "cluster_name": "-prod",
  "cluster_uuid": "i9NAe_QzRHucEpZYZryC2g",
  "timestamp": 1623266658140,
  "status": "yellow",
  "indices": {
    "count": 202,
    "shards": {
      "total": 586,
      "primaries": 202,
      "replication": 1.900990099009901,
      "index": {
        "shards": {
          "min": 1,
          "max": 3,
          "avg": 2.900990099009901
        },
        "primaries": {
          "min": 1,
          "max": 1,
          "avg": 1
        },
        "replication": {
          "min": 0,
          "max": 2,
          "avg": 1.900990099009901
        }
      }
    },
    "docs": {
      "count": 370312260,
      "deleted": 3947462
    },
    "store": {
      "size_in_bytes": 3189034450494
    },
    "fielddata": {
      "memory_size_in_bytes": 97432,
      "evictions": 0
    },
    "query_cache": {
      "memory_size_in_bytes": 279982219,
      "total_count": 55154087,
      "hit_count": 8434200,
      "miss_count": 46719887,
      "cache_size": 138866,
      "cache_count": 176913,
      "evictions": 38047
    },
    "completion": {
      "size_in_bytes": 17742243209
    },
    "segments": {
      "count": 11402,
      "memory_in_bytes": 18412282097,
      "terms_memory_in_bytes": 18155773137,
      "stored_fields_memory_in_bytes": 195121032,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 5587840,
      "points_memory_in_bytes": 25326656,
      "doc_values_memory_in_bytes": 30473432,
      "index_writer_memory_in_bytes": 10638712,
      "version_map_memory_in_bytes": 7469,
      "fixed_bit_set_memory_in_bytes": 4961328,
      "max_unsafe_auto_id_timestamp": 1623266399107,
      "file_sizes": {}
    }
  },
  "nodes": {
    "count": {
      "total": 13,
      "coordinating_only": 0,
      "data": 10,
      "ingest": 10,
      "master": 3,
      "ml": 13,
      "voting_only": 0
    },
    "versions": [
      "7.4.2"
    ],
    "os": {
      "available_processors": 26,
      "allocated_processors": 26,
      "names": [
        {
          "name": "Linux",
          "count": 13
        }
      ],
      "pretty_names": [
        {
          "pretty_name": "CentOS Linux 7 (Core)",
          "count": 13
        }
      ],
      "mem": {
        "total_in_bytes": 437795311616,
        "free_in_bytes": 103526707200,
        "used_in_bytes": 334268604416,
        "free_percent": 24,
        "used_percent": 76
      }
    },
    "process": {
      "cpu": {
        "percent": 6
      },
      "open_file_descriptors": {
        "min": 518,
        "max": 4618,
        "avg": 2489
      }
    },
    "jvm": {
      "max_uptime_in_millis": 1176357833,
      "versions": [
        {
          "version": "13.0.1",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "13.0.1+9",
          "vm_vendor": "AdoptOpenJDK",
          "bundled_jdk": true,
          "using_bundled_jdk": true,
          "count": 13
        }
      ],
      "mem": {
        "heap_used_in_bytes": 59687641696,
        "heap_max_in_bytes": 98557624320
      },
      "threads": 929
    },
    "fs": {
      "total_in_bytes": 5284315144192,
      "free_in_bytes": 1954137677824,
      "available_in_bytes": 1953919574016
    },
    "plugins": [
      {
        "name": "ingest-attachment",
        "version": "7.4.2",
        "elasticsearch_version": "7.4.2",
        "java_version": "1.8",
        "description": "Ingest processor that uses Apache Tika to extract contents",
        "classname": "org.elasticsearch.ingest.attachment.IngestAttachmentPlugin",
        "extended_plugins": [],
        "has_native_controller": false
      }
    ],
    "network_types": {
      "transport_types": {
        "security4": 13
      },
      "http_types": {
        "security4": 13
      }
    },
    "discovery_types": {
      "zen": 13
    },
    "packaging_types": [
      {
        "flavor": "default",
        "type": "docker",
        "count": 13
      }
    ]
  }
}

Are there any errors or signs of instability or leader elections in the Elasticsearch logs?

How can I find them? :slight_smile: Can I see them in Kibana?

If you are using Kubernetes you need to get them from the pods.

That's EOL, please upgrade ASAP.

Please also don't post pictures of text or code. They are difficult to read, impossible to search and replicate (if it's code), and some people may not be even able to see them :slight_smile:

Updated pictures with code sections, so they are better readable and searchable.

1 Like