ES 6.2.4 - ALLOCATION_FAILED TranslogCorruptedException

Hi Elastic Team,

There's a single unassigned shard in the cluster due to ALLOCATION_FAILED error which I can't figure out how to resolve.

I've already tried retrying several times using curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed=true&pretty' and also bounced the node with the problematic shard shown in the output of GET /_cluster/allocation/explain?pretty:

  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2018-07-30T11:28:21.654Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [_wtZ2Gq7TvWoKsII2yCN6Q]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: TranslogCorruptedException[operation size must be at least 4 but was: 0]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",

...
other output
...

{
  "node_id" : "_wtZ2Gq7TvWoKsII2yCN6Q",
  "node_name" : "the_problematic_node",
  "transport_address" : "171.134.100.215:9301",
  "node_attributes" : {
    "ml.machine_memory" : "270831382528",
    "ml.max_open_jobs" : "20",
    "ml.enabled" : "true",
    "node.type" : "hot"
  },
  "node_decision" : "no",
  "store" : {
    "in_sync" : true,
    "allocation_id" : "2ujCGOK-Swesr0sigjdeGw"
  },
  "deciders" : [
    {
      "decider" : "max_retry",
      "decision" : "NO",
      "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-07-30T11:28:21.654Z], failed_attempts[5], delayed=false, details[failed shard on node [_wtZ2Gq7TvWoKsII2yCN6Q]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: TranslogCorruptedException[operation size must be at least 4 but was: 0]; ], allocation_status[deciders_no]]]"
    }
  ]
},

Is there a way to recover this shard?

Many thanks,

After doing some further research into this, I think I've got myself a scrumptious translog corruption scenario here :scream:. Probably worth mentioning here is the corruption happened probably because our hosts were abruptly rebooted on that day plus maybe the fact that we were not using the recommended version of Java (1.8 u131) - we were using a lower update version.

The problematic index's template is configured to have only two primary shards (no replicas). The unassigned shard is one of the two primary shards:

GET /_cat/shards/?h=index,shard,prirep,state,unassigned.reason

my_problematic_index                  1 p UNASSIGNED ALLOCATION_FAILED
my_problematic_index                  0 p STARTED

Therefore, I accepted fate and decided to "exterminate" shard 1 with data loss (at least I still got half of the data in that index left), following the steps of: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html#_forced_allocation_on_unrecoverable_errors

curl -X POST "localhost:9200/_cluster/reroute?pretty" -H 'Content-Type: application/json' -d'
{
        "commands": [{
                "allocate_empty_primary": {
                        "index": "my_problematic_index",
                        "shard": 1,
                        "node": "some_new_target_node_name",
                        "accept_data_loss": true
                }
        }]
}

P.S.: allocate_stale_primary did not work, therefore used allocate_empty_primary.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.