Shard Allocation Failures After 5 Retries


We have a 23 nodes cluster with 5 master nodes, 3 coordinator nodes, and 15 data nodes. Our index has a total of 30 primary shards and 3 replicas. Size of the index is around 800Gb. Earlier this week we found that there is 1 unassigned shard after we rebooted one of the nodes, and it failed to get allocated, here is the response from allocation explain API:

    "index" : "index_name",
    "shard" : 11,
    "primary" : false,
    "current_state" : "unassigned",
    "unassigned_info" : {
      "reason" : "ALLOCATION_FAILED",
      "at" : "2021-06-22T06:56:10.775Z",
      "failed_allocation_attempts" : 5,
      "details" : "failed shard on node [YQU4hZwQQVifqzeCJ4G0Dw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[index_name][11]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ",
      "last_allocation_status" : "no_attempt"
    "can_allocate" : "no",
    "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
    "node_allocation_decisions" : [
        "node_id" : "2eM6R8OPQDek7BVJ7w72XA",
        "node_name" : "esd01",
        "transport_address" : "x.x.x.235:9300",
        "node_attributes" : {
          "xpack.installed" : "true"
        "node_decision" : "no",
        "deciders" : [
            "decider" : "max_retry",
            "decision" : "NO",
            "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-06-22T06:56:10.775Z], failed_attempts[5], failed_nodes[[YQU4hZwQQVifqzeCJ4G0Dw]], delayed=false, details[failed shard on node [YQU4hZwQQVifqzeCJ4G0Dw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[index_name][11]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]]]"
        "node_id" : "ySGiZ52BQwG8tGWJ4pcayA",
        "node_name" : "esd14",
        "transport_address" : "x.x.x.248:9300",
        "node_attributes" : {
          "xpack.installed" : "true"
        "node_decision" : "no",
        "deciders" : [
            "decider" : "max_retry",
            "decision" : "NO",
            "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-06-22T06:56:10.775Z], failed_attempts[5], failed_nodes[[YQU4hZwQQVifqzeCJ4G0Dw]], delayed=false, details[failed shard on node [YQU4hZwQQVifqzeCJ4G0Dw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[index_name][11]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]]; ], allocation_status[no_attempt]]]"
            "decider" : "same_shard",
            "decision" : "NO",
            "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[index_name][11], node[ySGiZ52BQwG8tGWJ4pcayA], [R], s[STARTED], a[id=SmMfRLraRiS6Pfvy2IdxLA]]"

we have also found lots of exceptions like this on our ElasticSearch Data nodes:

[2021-06-22T07:53:33,766][WARN ][o.e.c.a.s.ShardStateAction] [esd12] unexpected failure while sending request [internal:cluster/shard/failure] to [{esm01}{0NjBrIyQRc65BUfwzfjGow}{gs7hdoLFSY-eUQktHbIdBQ}{x.x.x.212}{x.x.x.212:9300}{m}{xpack.installed=true}] for shard entry [sh
    ard id [[index_name][4]], allocation id [dsxBh1jPSiaTvzZPRf24zA], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [index_name][4], node[OZb-Z6SQQnuEf6Djk4j-5w], [R], s[STARTED], a[id=dsxBh1jPSia
    TvzZPRf24zA]], failure [RemoteTransportException[[esd09][x.x.x.243:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[index_name][4] operation primary term [1] is too old (current [2])]; ], markAsStale [true]]
    org.elasticsearch.transport.RemoteTransportException: [esm01][x.x.x.212:9300][internal:cluster/shard/failure]
    Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [1] did not match current primary term [2]
            at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute( ~[elasticsearch-7.5.0.jar:7.5.0]
            at org.elasticsearch.cluster.service.MasterService.executeTasks( ~[elasticsearch-7.5.0.jar:7.5.0]
            at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs( ~[elasticsearch-7.5.0.jar:7.5.0]
            at org.elasticsearch.cluster.service.MasterService.runTasks( ~[elasticsearch-7.5.0.jar:7.5.0]
            at org.elasticsearch.cluster.service.MasterService.access$000( ~[elasticsearch-7.5.0.jar:7.5.0]
            at org.elasticsearch.cluster.service.MasterService$ ~[elasticsearch-7.5.0.jar:7.5.0]
            at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed( ~[elasticsearch-7.5.0.jar:7.5.0]
            at org.elasticsearch.cluster.service.TaskBatcher$ ~[elasticsearch-7.5.0.jar:7.5.0]
            at org.elasticsearch.common.util.concurrent.ThreadContext$ ~[elasticsearch-7.5.0.jar:7.5.0]
            at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean( ~[elasticsearch-7.5.0.jar:7.5.0]
            at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$ ~[elasticsearch-7.5.0.jar:7.5.0]
            at java.util.concurrent.ThreadPoolExecutor.runWorker( ~[?:?]
            at java.util.concurrent.ThreadPoolExecutor$ ~[?:?]
            at [?:?]

It would be appreciated if you can provide some inputs on possible causes.

The first problem can happen if the rebooted node was the master. It has been fixed in 7.13.0:

About the second issue, is that related to the first, i.e., same index and shard or is it something completely separate? Is it all for one index/shard and how many times did it occur? This can happen in edge cases and the likelihood is somewhat increased by multiple replicas though I do find it odd if it happens frequently.

We have 1 index only, with 15 data nodes and 3 replicas. Index shards were set to 30. This issue only happened once for the past 6 months.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.