Unassigned Shard

Hi All,
A while ago I reported this issue, but back then I was running a single node cluser and though that may have been the issue. Yesterday I migrated to a triple node cluster configuration:
3 elasticsearch nodes (master and data nodes)
1 logstash
1 kibana
5 filebeats.
Everything is running version 7.0.0
Yesterdat after migrating the cluster, everything went well.
This morning I was faced with a red cluster, the issue is an unassigned shard for the newest index, after hitting the allocation exaplain endpoint I got this:

{
  "index": "logstash-000051",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "ALLOCATION_FAILED",
    "at": "2019-12-06T12:34:31.577Z",
    "failed_allocation_attempts": 5,
    "details": "failed shard on node [25AFdrnqTG6lRsDR4CxZpQ]: failed recovery, failure RecoveryFailedException[[logstash-000051][0]: Recovery failed on {elasticsearch2}{25AFdrnqTG6lRsDR4CxZpQ}{mPx8umy9QIWE7hR8vdNu3w}{44.128.0.11}{44.128.0.11:9301}{ml.machine_memory=6442450944, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/nodes/0/indices/50zRNBhxT3qdJpUdS4B47Q/0/index/_1a4.fdt]; ",
    "last_allocation_status": "no"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions": [
    {
      "node_id": "25AFdrnqTG6lRsDR4CxZpQ",
      "node_name": "elasticsearch2",
      "transport_address": "44.128.0.11:9301",
      "node_attributes": {
        "ml.machine_memory": "6442450944",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true"
      },
      "node_decision": "no",
      "store": {
        "in_sync": true,
        "allocation_id": "fhFsHVG5RsW_kMxAYfYAcg"
      },
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-12-06T12:34:31.577Z], failed_attempts[5], delayed=false, details[failed shard on node [25AFdrnqTG6lRsDR4CxZpQ]: failed recovery, failure RecoveryFailedException[[logstash-000051][0]: Recovery failed on {elasticsearch2}{25AFdrnqTG6lRsDR4CxZpQ}{mPx8umy9QIWE7hR8vdNu3w}{44.128.0.11}{44.128.0.11:9301}{ml.machine_memory=6442450944, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/nodes/0/indices/50zRNBhxT3qdJpUdS4B47Q/0/index/_1a4.fdt]; ], allocation_status[deciders_no]]]"
        },
        {
          "decider": "replica_after_primary_active",
          "decision": "YES",
          "explanation": "shard is primary and can be allocated"
        },
        {
          "decider": "enable",
          "decision": "YES",
          "explanation": "all allocations are allowed"
        },
        {
          "decider": "node_version",
          "decision": "YES",
          "explanation": "the primary shard is new or already existed on the node"
        },
        {
          "decider": "snapshot_in_progress",
          "decision": "YES",
          "explanation": "no snapshots are currently running"
        },
        {
          "decider": "restore_in_progress",
          "decision": "YES",
          "explanation": "ignored as shard is not being recovered from a snapshot"
        },
        {
          "decider": "filter",
          "decision": "YES",
          "explanation": "node passes include/exclude/require filters"
        },
        {
          "decider": "same_shard",
          "decision": "YES",
          "explanation": "the shard does not exist on the same node"
        },
        {
          "decider": "disk_threshold",
          "decision": "YES",
          "explanation": "enough disk for shard on node, free: [4.6tb], shard size: [0b], free after allocating shard: [4.6tb]"
        },
        {
          "decider": "throttling",
          "decision": "YES",
          "explanation": "below primary recovery limit of [4]"
        },
        {
          "decider": "shards_limit",
          "decision": "YES",
          "explanation": "total shard limits are disabled: [index: -1, cluster: -1] <= 0"
        },
        {
          "decider": "awareness",
          "decision": "YES",
          "explanation": "allocation awareness is not enabled, set cluster setting [cluster.routing.allocation.awareness.attributes] to enable it"
        }
      ]
    },
    {
      "node_id": "f4qB5RJ-QN-46B44ZRLcrQ",
      "node_name": "elasticsearch3",
      "transport_address": "44.128.0.11:9302",
      "node_attributes": {
        "ml.machine_memory": "6442450944",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "gjegRSM1Rbi22HYJOFYINw",
      "node_name": "elasticsearch1",
      "transport_address": "44.128.0.11:9300",
      "node_attributes": {
        "ml.machine_memory": "6442450944",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    }
  ]
}

The original node was named elasticsearch1, I added two new nodes elasticsearch2 and elasticsearch3. They all share the same file system but each has its own data folder.

Any ideas?

This shard is missing a vital file and cannot be recovered. You will need to work out why this file is missing (it was not removed by Elasticsearch) and stop it from happening again.

I see a previous post from you about the same sort of issue here:

Did you try the same diagnosis steps as before? I.e. does ls /usr/share/elasticsearch/data/nodes/0/indices/50zRNBhxT3qdJpUdS4B47Q/0/index/ show a file called _1a4.fdt but ls /usr/share/elasticsearch/data/nodes/0/indices/50zRNBhxT3qdJpUdS4B47Q/0/index/_1a4.fdt reports No such file or directory? If so there's something fundamentally wrong with your filesystem again.

I tried, this time the file was missing. Then the cluster was restarted and when it came back the index was working like nothing happened. There is something very fishy here. I will continue trying to diagnose the FS

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.