ES 6.2.4 - ALLOCATION_FAILED TranslogCorruptedException

ld_pvl · July 30, 2018, 1:12pm

Hi Elastic Team,

There's a single unassigned shard in the cluster due to ALLOCATION_FAILED error which I can't figure out how to resolve.

I've already tried retrying several times using curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed=true&pretty' and also bounced the node with the problematic shard shown in the output of GET /_cluster/allocation/explain?pretty:

  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2018-07-30T11:28:21.654Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [_wtZ2Gq7TvWoKsII2yCN6Q]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: TranslogCorruptedException[operation size must be at least 4 but was: 0]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",

...
other output
...

{
  "node_id" : "_wtZ2Gq7TvWoKsII2yCN6Q",
  "node_name" : "the_problematic_node",
  "transport_address" : "171.134.100.215:9301",
  "node_attributes" : {
    "ml.machine_memory" : "270831382528",
    "ml.max_open_jobs" : "20",
    "ml.enabled" : "true",
    "node.type" : "hot"
  },
  "node_decision" : "no",
  "store" : {
    "in_sync" : true,
    "allocation_id" : "2ujCGOK-Swesr0sigjdeGw"
  },
  "deciders" : [
    {
      "decider" : "max_retry",
      "decision" : "NO",
      "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-07-30T11:28:21.654Z], failed_attempts[5], delayed=false, details[failed shard on node [_wtZ2Gq7TvWoKsII2yCN6Q]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: TranslogCorruptedException[operation size must be at least 4 but was: 0]; ], allocation_status[deciders_no]]]"
    }
  ]
},

Is there a way to recover this shard?

Many thanks,

ld_pvl · August 3, 2018, 1:35pm

After doing some further research into this, I think I've got myself a scrumptious translog corruption scenario here . Probably worth mentioning here is the corruption happened probably because our hosts were abruptly rebooted on that day plus maybe the fact that we were not using the recommended version of Java (1.8 u131) - we were using a lower update version.

The problematic index's template is configured to have only two primary shards (no replicas). The unassigned shard is one of the two primary shards:

GET /_cat/shards/?h=index,shard,prirep,state,unassigned.reason

my_problematic_index                  1 p UNASSIGNED ALLOCATION_FAILED
my_problematic_index                  0 p STARTED

Therefore, I accepted fate and decided to "exterminate" shard 1 with data loss (at least I still got half of the data in that index left), following the steps of: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html#_forced_allocation_on_unrecoverable_errors

curl -X POST "localhost:9200/_cluster/reroute?pretty" -H 'Content-Type: application/json' -d'
{
        "commands": [{
                "allocate_empty_primary": {
                        "index": "my_problematic_index",
                        "shard": 1,
                        "node": "some_new_target_node_name",
                        "accept_data_loss": true
                }
        }]
}

P.S.: allocate_stale_primary did not work, therefore used allocate_empty_primary.

system · August 31, 2018, 1:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic Translog corrupted error (Unassigned shards) Elasticsearch	2	169	November 15, 2023
Unassigned Shard Elasticsearch	4	715	January 3, 2020
UNASSIGNED ALLOCATION_FAILED Elasticsearch	2	627	February 10, 2023
Elasticseach failed shard allocation Elasticsearch	8	1353	May 28, 2021
Unassigned shards found Elasticsearch	2	5238	October 18, 2017

ES 6.2.4 - ALLOCATION_FAILED TranslogCorruptedException

Related topics