Primary Shard Allocation_Failed

ivandbarria · September 23, 2022, 3:45pm

Hello,
Hope everyone is doing great.

We had a power outage and our ES server went down. When the server went back online I noticed ES status on red:

"cluster_name" : "elasticsearch",
 "status" : "red",
 "timed_out" : false,
 "number_of_nodes" : 1,
 "number_of_data_nodes" : 1,
 "active_primary_shards" : 5,
 "active_shards" : 5,
 "relocating_shards" : 0,
 "initializing_shards" : 0,
 "unassigned_shards" : 26,
 "delayed_unassigned_shards" : 0,
 "number_of_pending_tasks" : 0,
 "number_of_in_flight_fetch" : 0,
 "task_max_waiting_in_queue_millis" : 0,
 "active_shards_percent_as_number" : 16.129032258064516

Checked status:

index            shard prirep state      node        unassigned.reason
index_expedients 1     r      UNASSIGNED             CLUSTER_RECOVERED
index_expedients 3     r      UNASSIGNED             CLUSTER_RECOVERED
index_expedients 4     r      UNASSIGNED             CLUSTER_RECOVERED
index_expedients 2     p      UNASSIGNED             ALLOCATION_FAILED
index_expedients 2     r      UNASSIGNED             CLUSTER_RECOVERED
index_expedients 0     r      UNASSIGNED             CLUSTER_RECOVERED
index_customers  1     p      UNASSIGNED             ALLOCATION_FAILED
index_customers  1     r      UNASSIGNED             CLUSTER_RECOVERED
index_customers  3     p      UNASSIGNED             ALLOCATION_FAILED
index_customers  3     r      UNASSIGNED             CLUSTER_RECOVERED
index_customers  2     p      UNASSIGNED             ALLOCATION_FAILED
index_customers  2     r      UNASSIGNED             CLUSTER_RECOVERED
index_customers  4     p      UNASSIGNED             ALLOCATION_FAILED
index_customers  4     r      UNASSIGNED             CLUSTER_RECOVERED
index_customers  0     p      UNASSIGNED             ALLOCATION_FAILED
index_customers  0     r      UNASSIGNED             CLUSTER_RECOVERED
index_general    1     p      UNASSIGNED             ALLOCATION_FAILED
index_general    1     r      UNASSIGNED             CLUSTER_RECOVERED
index_general    3     p      UNASSIGNED             ALLOCATION_FAILED
index_general    3     r      UNASSIGNED             CLUSTER_RECOVERED
index_general    2     p      UNASSIGNED             ALLOCATION_FAILED
index_general    2     r      UNASSIGNED             CLUSTER_RECOVERED
index_general    4     p      UNASSIGNED             ALLOCATION_FAILED
index_general    4     r      UNASSIGNED             CLUSTER_RECOVERED
index_general    0     p      UNASSIGNED             ALLOCATION_FAILED
index_general    0     r      UNASSIGNED             CLUSTER_RECOVERED
index_expedients 1     p      STARTED    SPANBIWEB02
index_expedients 3     p      STARTED    SPANBIWEB02
index_expedients 4     p      STARTED    SPANBIWEB02
index_expedients 0     p      STARTED    SPANBIWEB02
.kibana          0     p      STARTED    SPANBIWEB02

Im getting this error at the primary shards that have failed Allocation:

{
  "index": "index_expedients",
  "shard": 2,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "ALLOCATION_FAILED",
    "at": "2022-09-23T15:32:36.195Z",
    "failed_allocation_attempts": 5,
    "details": "failed shard on node [hfFii1P3QxKPvijusWAExA]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: EOFException[read past EOF. pos [1973] length: [4] end: [1973]]; ",
    "last_allocation_status": "no"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions": [
    {
      "node_id": "hfFii1P3QxKPvijusWAExA",
      "node_name": "SPANBIWEB02",
      "transport_address": "127.0.0.1:9300",
      "node_attributes": {
        "ml.machine_memory": "17179398144",
        "xpack.installed": "true",
        "ml.max_open_jobs": "20",
        "ml.enabled": "true"
      },
      "node_decision": "no",
      "store": {
        "in_sync": true,
        "allocation_id": "G30Aru58QhWc7kY2hrNecw"
      },
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2022-09-23T15:32:36.195Z], failed_attempts[5], delayed=false, details[failed shard on node [hfFii1P3QxKPvijusWAExA]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: EOFException[read past EOF. pos [1973] length: [4] end: [1973]]; ], allocation_status[deciders_no]]]"
        }
      ]
    }
  ]
}

Unfortunately I don't have a snapshot to recover. Is there a way to recover these shards?

Regards,

DavidTurner · September 24, 2022, 7:35am

This error means that your disks were behaving incorrectly: they confirmed to Elasticsearch that some data was written durably too soon, leading to that data being lost in the power outage. This is a pretty common trick on cheaper hardware since it makes it look like the disks are much faster than they actually are, although there's sometimes a way to configure them to behave properly. The reference manual has some notes on troubleshooting disk issues - in particular diskchecker.pl should be able to find the same problem which will help you explore the possible fixes.

Unfortunately it means that the data Elasticsearch wrote simply never made it to the disk, it's gone, there's no way to recover the missing portions.

The best remedy is to restore from a snapshot. If your data matters to you, you really should be taking snapshots!

You may be able to use the elasticsearch-shard tool to recover some of the remaining data, although there's no guarantee how much might be lost in the process.

ivandbarria · September 26, 2022, 5:27pm

Thank you David. Im not able to find elasticsearch_ shard tool, Im using version 6.3.0
@DavidTurner Do you know if there´s a way to get elasticsearch_ shard tool for version 6.3.0?

DavidTurner · September 26, 2022, 8:17pm

No, sorry, I don't remember if there's a way to recover anything from this situation in such old versions.

warkolm · September 26, 2022, 9:32pm

6.3 is super old and EOL, you should really look to upgrade ASAP.

system · October 24, 2022, 9:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shards failure - recovery possible? Elasticsearch	7	3619	June 6, 2020
ES 6.2.4 - ALLOCATION_FAILED TranslogCorruptedException Elasticsearch	2	6062	August 31, 2018
Elasticsearch cluster status is red. Allocate missing primary shards and replica shards Elasticsearch	2	12272	January 13, 2019
Perma-Unallocated primary shards after a node has left the cluster Elasticsearch	2	536	July 6, 2017
13mil documents and Elasticsearch status=red after server restart Elasticsearch	7	441	July 6, 2017

Primary Shard Allocation_Failed

Related topics