Primary Shard Allocation_Failed

Hello,
Hope everyone is doing great.

We had a power outage and our ES server went down. When the server went back online I noticed ES status on red:

"cluster_name" : "elasticsearch",
 "status" : "red",
 "timed_out" : false,
 "number_of_nodes" : 1,
 "number_of_data_nodes" : 1,
 "active_primary_shards" : 5,
 "active_shards" : 5,
 "relocating_shards" : 0,
 "initializing_shards" : 0,
 "unassigned_shards" : 26,
 "delayed_unassigned_shards" : 0,
 "number_of_pending_tasks" : 0,
 "number_of_in_flight_fetch" : 0,
 "task_max_waiting_in_queue_millis" : 0,
 "active_shards_percent_as_number" : 16.129032258064516

Checked status:

index            shard prirep state      node        unassigned.reason
index_expedients 1     r      UNASSIGNED             CLUSTER_RECOVERED
index_expedients 3     r      UNASSIGNED             CLUSTER_RECOVERED
index_expedients 4     r      UNASSIGNED             CLUSTER_RECOVERED
index_expedients 2     p      UNASSIGNED             ALLOCATION_FAILED
index_expedients 2     r      UNASSIGNED             CLUSTER_RECOVERED
index_expedients 0     r      UNASSIGNED             CLUSTER_RECOVERED
index_customers  1     p      UNASSIGNED             ALLOCATION_FAILED
index_customers  1     r      UNASSIGNED             CLUSTER_RECOVERED
index_customers  3     p      UNASSIGNED             ALLOCATION_FAILED
index_customers  3     r      UNASSIGNED             CLUSTER_RECOVERED
index_customers  2     p      UNASSIGNED             ALLOCATION_FAILED
index_customers  2     r      UNASSIGNED             CLUSTER_RECOVERED
index_customers  4     p      UNASSIGNED             ALLOCATION_FAILED
index_customers  4     r      UNASSIGNED             CLUSTER_RECOVERED
index_customers  0     p      UNASSIGNED             ALLOCATION_FAILED
index_customers  0     r      UNASSIGNED             CLUSTER_RECOVERED
index_general    1     p      UNASSIGNED             ALLOCATION_FAILED
index_general    1     r      UNASSIGNED             CLUSTER_RECOVERED
index_general    3     p      UNASSIGNED             ALLOCATION_FAILED
index_general    3     r      UNASSIGNED             CLUSTER_RECOVERED
index_general    2     p      UNASSIGNED             ALLOCATION_FAILED
index_general    2     r      UNASSIGNED             CLUSTER_RECOVERED
index_general    4     p      UNASSIGNED             ALLOCATION_FAILED
index_general    4     r      UNASSIGNED             CLUSTER_RECOVERED
index_general    0     p      UNASSIGNED             ALLOCATION_FAILED
index_general    0     r      UNASSIGNED             CLUSTER_RECOVERED
index_expedients 1     p      STARTED    SPANBIWEB02
index_expedients 3     p      STARTED    SPANBIWEB02
index_expedients 4     p      STARTED    SPANBIWEB02
index_expedients 0     p      STARTED    SPANBIWEB02
.kibana          0     p      STARTED    SPANBIWEB02

Im getting this error at the primary shards that have failed Allocation:

{
  "index": "index_expedients",
  "shard": 2,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "ALLOCATION_FAILED",
    "at": "2022-09-23T15:32:36.195Z",
    "failed_allocation_attempts": 5,
    "details": "failed shard on node [hfFii1P3QxKPvijusWAExA]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: EOFException[read past EOF. pos [1973] length: [4] end: [1973]]; ",
    "last_allocation_status": "no"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions": [
    {
      "node_id": "hfFii1P3QxKPvijusWAExA",
      "node_name": "SPANBIWEB02",
      "transport_address": "127.0.0.1:9300",
      "node_attributes": {
        "ml.machine_memory": "17179398144",
        "xpack.installed": "true",
        "ml.max_open_jobs": "20",
        "ml.enabled": "true"
      },
      "node_decision": "no",
      "store": {
        "in_sync": true,
        "allocation_id": "G30Aru58QhWc7kY2hrNecw"
      },
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2022-09-23T15:32:36.195Z], failed_attempts[5], delayed=false, details[failed shard on node [hfFii1P3QxKPvijusWAExA]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: EOFException[read past EOF. pos [1973] length: [4] end: [1973]]; ], allocation_status[deciders_no]]]"
        }
      ]
    }
  ]
}

Unfortunately I don't have a snapshot to recover. Is there a way to recover these shards?

Regards,

This error means that your disks were behaving incorrectly: they confirmed to Elasticsearch that some data was written durably too soon, leading to that data being lost in the power outage. This is a pretty common trick on cheaper hardware since it makes it look like the disks are much faster than they actually are, although there's sometimes a way to configure them to behave properly. The reference manual has some notes on troubleshooting disk issues - in particular diskchecker.pl should be able to find the same problem which will help you explore the possible fixes.

Unfortunately it means that the data Elasticsearch wrote simply never made it to the disk, it's gone, there's no way to recover the missing portions.

The best remedy is to restore from a snapshot. If your data matters to you, you really should be taking snapshots!

You may be able to use the elasticsearch-shard tool to recover some of the remaining data, although there's no guarantee how much might be lost in the process.

Thank you David. Im not able to find elasticsearch_ shard tool, Im using version 6.3.0
@DavidTurner Do you know if there´s a way to get elasticsearch_ shard tool for version 6.3.0?

No, sorry, I don't remember if there's a way to recover anything from this situation in such old versions.

6.3 is super old and EOL, you should really look to upgrade ASAP.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.