Shards failure - recovery possible?

Rysiu · May 9, 2020, 11:00am

Hi,

Unfortunately, there was a problem on my cluster when the power was switched off.

Cluster has RED status - a few shards do not want to get up.

I've tried using reroute:
POST /_cluster/reroute?retry_failed=true

but it's not working.

If I make an query:

GET /_cluster/allocation/explain
{
  "index": "my-index-xxx",
  "shard": 0,
  "primary": true
}

The cluster will return:

{
  "index" : "my-index-xxx",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-05-09T10:19:09.285Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [4oMUIeiASHeq8ZYNIx5hUg]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog from source [/var/lib/elasticsearch/nodes/0/indices/NiCEzOZjS9ib2oG2AC3QXg/0/translog/translog-134.tlog] is corrupted, translog truncated]; nested: EOFException[read past EOF. pos [1718122] length: [4] end: [1718122]]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "4oMUIeiASHeq8ZYNIx5hUg",
      "node_name" : "data-006",
      "transport_address" : "192.168.88.40:9300",
      "node_attributes" : {
        "machine_id" : "M001",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "f7brAMpUSgqLn1ERtP5teg"
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-05-09T10:19:09.285Z], failed_attempts[5], failed_nodes[[4oMUIeiASHeq8ZYNIx5hUg]], delayed=false, details[failed shard on node [4oMUIeiASHeq8ZYNIx5hUg]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog from source [/var/lib/elasticsearch/nodes/0/indices/NiCEzOZjS9ib2oG2AC3QXg/0/translog/translog-134.tlog] is corrupted, translog truncated]; nested: EOFException[read past EOF. pos [1718122] length: [4] end: [1718122]]; ], allocation_status[deciders_no]]]"
        }
      ]
    },
...

Any ideas on how to recover the shard?

DavidTurner · May 9, 2020, 1:09pm

This probably relates your other recent post:

Looks like your storage system does not, in fact, implement fsync() correctly.

What's the whole response from the allocation explain API? It looks like all copies of this shard are broken. Does this index have replicas?

Rysiu · May 9, 2020, 1:18pm

Unfortunately, this index had two replicas.

Each of the elasticsearch node has its own hard drive.

During a power outage, I lost power to the servers.

Despite the two replicas (a total (+primary) of three different/physical disks) the index have RED status.

Is something like that quite possible?

DavidTurner · May 9, 2020, 1:24pm

Repeating this request again:

Rysiu · May 9, 2020, 1:30pm

Full info:

{
  "index" : "my-index-xxx",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-05-09T11:51:49.938Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [4oMUIeiASHeq8ZYNIx5hUg]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog from source [/var/lib/elasticsearch/nodes/0/indices/NiCEzOZjS9ib2oG2AC3QXg/0/translog/translog-134.tlog] is corrupted, translog truncated]; nested: EOFException[read past EOF. pos [1718122] length: [4] end: [1718122]]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "4oMUIeiASHeq8ZYNIx5hUg",
      "node_name" : "data-006",
      "transport_address" : "192.168.88.40:9300",
      "node_attributes" : {
        "machine_id" : "M001",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "f7brAMpUSgqLn1ERtP5teg"
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-05-09T11:51:49.938Z], failed_attempts[5], failed_nodes[[4oMUIeiASHeq8ZYNIx5hUg]], delayed=false, details[failed shard on node [4oMUIeiASHeq8ZYNIx5hUg]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog from source [/var/lib/elasticsearch/nodes/0/indices/NiCEzOZjS9ib2oG2AC3QXg/0/translog/translog-134.tlog] is corrupted, translog truncated]; nested: EOFException[read past EOF. pos [1718122] length: [4] end: [1718122]]; ], allocation_status[deciders_no]]]"
        }
      ]
    },
    {
      "node_id" : "5s8OBqvOSjOMrbwKgf0UVg",
      "node_name" : "data-005",
      "transport_address" : "192.168.88.39:9300",
      "node_attributes" : {
        "machine_id" : "M001",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "AisEn6iUT3q7PG08eVGIdg",
      "node_name" : "data-000",
      "transport_address" : "192.168.88.34:9300",
      "node_attributes" : {
        "machine_id" : "M000",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "tnMCOvcQTbOs40MRsQC-Hw"
      }
    },
    {
      "node_id" : "LAlFofPkRXCTb6AQzPfenQ",
      "node_name" : "data-002",
      "transport_address" : "192.168.88.36:9300",
      "node_attributes" : {
        "machine_id" : "M000",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "OAg_piYbQ5S2Lnyy582erg",
      "node_name" : "data-001",
      "transport_address" : "192.168.88.35:9300",
      "node_attributes" : {
        "machine_id" : "M000",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "L-sBscKGTjWe8EnFFNgxxg"
      }
    },
    {
      "node_id" : "gIZy3SQVSVCn6nMgMr51Bg",
      "node_name" : "data-008",
      "transport_address" : "192.168.88.42:9300",
      "node_attributes" : {
        "machine_id" : "M001",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "i45HkwMSTYGCyfwnJ1KFAg",
      "node_name" : "data-004",
      "transport_address" : "192.168.88.38:9300",
      "node_attributes" : {
        "machine_id" : "M000",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "jZyB7kNeS_WGI_Q6HgYj3A",
      "node_name" : "data-007",
      "transport_address" : "192.168.88.41:9300",
      "node_attributes" : {
        "machine_id" : "M001",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "loQtfthqRvqWNadLxAr5zA",
      "node_name" : "data-003",
      "transport_address" : "192.168.88.37:9300",
      "node_attributes" : {
        "machine_id" : "M000",
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    }
  ]
}

DavidTurner · May 9, 2020, 1:52pm

Ok the other two copies of this shard were on nodes data-000 and data-001 but those copies are stale (i.e. do not contain all writes) possibly because those nodes failed first. Unfortunately that means there's no way to safely recover this data from within Elasticsearch.

I'd recommend deleting this index, restoring it from a recent snapshot, and then repeating any indexing that took place since the snapshot was taken. Alternatively you could perhaps recreate it from the source data assuming you've still got that.

Rysiu · May 9, 2020, 3:10pm

That's what I'll do.
Thanks a lot!

system · June 6, 2020, 3:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Primary Shard Allocation_Failed Elasticsearch	5	1370	October 24, 2022
ES 6.2.4 - ALLOCATION_FAILED TranslogCorruptedException Elasticsearch	2	6062	August 31, 2018
ES Cluster Recovery and Restart Elasticsearch	3	623	July 6, 2017
ES Ate My Shards/Indexes Elasticsearch	13	574	July 6, 2017
How to restart/recover a shard? Elasticsearch	24	1386	July 6, 2017

Shards failure - recovery possible?

Related topics