Failed shard after OOMing, corrupt index

My server took on too much load and crashed ES. After I retarted, my index showed up as corrupt. The output of the shard allocation explain API is here:

curl -XGET localhost:9200/_cluster/allocation/explain?pretty        
{
  "index" : "twitter",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2018-11-06T06:11:15.562Z",
 "failed_allocation_attempts" : 5,                                                                       [0/819]
    "details" : "failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed recovery, failure RecoveryFailedException[[t
witter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{yYDvXMKnS9KhaIlzPEsJNg}{10.142.0.2}{10.142.0.2:9300
}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[fai
led to create engine]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerL
ength==16 (resource=SimpleFSIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/trans
log/translog-1228.ckp\"))]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in
-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "CxXWE8BiQbS4ThB9AvvGQA",
      "node_name" : "node-1",
      "transport_address" : "10.142.0.2:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "gxegPAMyQa21MH5NxQEACw"
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - man
ually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-11-
06T06:11:15.562Z], failed_attempts[5], delayed=false, details[failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed
 recovery, failure RecoveryFailedException[[twitter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{yYDvXM
KnS9KhaIlzPEsJNg}{10.142.0.2}{10.142.0.2:9300}]; nested: IndexShardRecoveryException[failed to recover from gateway
]; nested: EngineCreationFailureException[failed to create engine]; nested: CorruptIndexException[misplaced codec f
ooter (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path=\"/var/lib/elasticsearch/n
odes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/translog/translog-1228.ckp\"))]; ], allocation_status[deciders_no]]]"
        }
      ]
    }
  ]
}

I ran the following command to fix the index but it ended up saying that everything was fine

sudo java -cp lucene-core*.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex "[NODE]/0/[INDEX]/0/index/" -exorcise

Took 1972.855 sec total.

Is there anything I can do fix the index?

The corruption is in the translog, rather than in the Lucene index; CheckIndex only checks the Lucene index so it will not notice this.

I'm not sure how Elasticsearch got into this state, however. What version are you using? Could you share the stack trace from a failed allocation, which will be in the log file? You can retry, in order to get the same exception, using

POST /_cluster/reroute?retry_failed=true

Could you also zip up all the files matching the following pattern and share them here?

/var/lib/elasticsearch/nodes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/translog/*.ckp

Hi David, thanks for the response. I'm running ES 6.2.2.

My logging isn't working right now for some reason but in the meantime I'll add the relevant bits from the output of the reroute attempt. Not sure what the preferred method of sharing the zip is, so I added it to this repo: https://github.com/jtalmi/twitter/blob/master/es_broken_index.zip

Let me know if that's ok.

 "twitter" : {
          "shards" : {
            "0" : [
              {
                "state" : "UNASSIGNED",
                "primary" : true,
                "node" : null,
                "relocating_node" : null,
                "shard" : 0,
                "index" : "twitter",
                "recovery_source" : {
                  "type" : "EXISTING_STORE"
                },
                "unassigned_info" : {
                  "reason" : "ALLOCATION_FAILED",
                  "at" : "2018-11-09T06:56:33.069Z",
                  "failed_attempts" : 5,
                  "delayed" : false,
                  "details" : "failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed recovery, failure RecoveryFailedException[[twitter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{Wh32CTV_QWiZkFFB06YEsw}{10.142.0.2}{10.142.0.2:9300}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/translog/translog-1228.ckp\"))]; ",
                  "allocation_status" : "deciders_no"
                }
              }
            ]
          }
        },

and

    "routing_nodes" : {
      "unassigned" : [
        {
          "state" : "UNASSIGNED",
          "primary" : true,
          "node" : null,
          "relocating_node" : null,
          "shard" : 0,
          "index" : "twitter",
          "recovery_source" : {
            "type" : "EXISTING_STORE"
          },
          "unassigned_info" : {
            "reason" : "ALLOCATION_FAILED",
            "at" : "2018-11-09T06:56:33.069Z",
            "failed_attempts" : 5,
            "delayed" : false,
            "details" : "failed shard on node [CxXWE8BiQbS4ThB9AvvGQA]: failed recovery, failure RecoveryFailedException[[twitter][0]: Recovery failed on {node-1}{CxXWE8BiQbS4ThB9AvvGQA}{Wh32CTV_QWiZkFFB06YEsw}{10.142.0.2}{10.142.0.2:9300}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: CorruptIndexException[misplaced codec footer (file truncated?): length=0 but footerLength==16 (resource=SimpleFSIndexInput(path=\"/var/lib/elasticsearch/nodes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/translog/translog-1228.ckp\"))]; ",
            "allocation_status" : "deciders_no"
          }
        }
      ],

This looks like a bug and I opened #35407 to prevent it happening in future.

To recover in your case I think it will be sufficient to delete the empty checkpoint file that was partly created when the OOM occurred:

/var/lib/elasticsearch/nodes/0/indices/l1VcSQySRmuyFGTBBPjX9g/0/translog/translog-1228.ckp

Then retry the allocation one more time:

POST /_cluster/reroute?retry_failed=true

That worked! Thanks a lot. I'll do some reading on translogs to understand the issue a bit more.

The section of the 2.x definitive guide is a good overview if you are interested, although the actual mechanics of the translog aren't particularly important here. The particular problem is that we copied translog.ckp to translog-1228.ckp which is a two-step process: create the file and then write its contents. The OOM happened right in the middle of this, after the file was created but before its contents were written, leaving an empty file on disk. However we weren't expecting to hit an OOM in this state, so on startup we treated this unexpected state as a corrupt shard and refused to start it. The change in #35407 is to copy translog.ckp to a temporary file and then atomically rename it to translog-1228.ckp which avoids the problematic intermediate state. Deleting the empty file also solved your problem because it took the system back into a state in which we expected to crash, before the file copy started.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.