Exception at backup restoring: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?)

Hi everyone,

I got an exception while restoring a snapshot in ES 5.2.

The snapshot creation and restoration were made following the elasticsearch recommendation available at https://www.elastic.co/guide/en/elasticsearch/reference/5.2/modules-snapshots.html

The snapshot was created in a Ubuntu machine and ES 5.1. After that, I copied the "path.repo" content (almost 26 Gb) to a "mac os extended journaled" external HD so that I could restore it in a Mac OSX (Sierra 10.12.3).

In the Mac OS,
(1) I copied the snapshot content to the respective ES 5.2 "path.repo" directory;
(2) Restart ES 5.2;
(3) Run PUT localhost:9200/_snapshot/elastic_backup

{
    "type": "fs",
    "settings": {
        "location": <path.repo>,
        "compress": true
    }
}

(4) Run GET localhost:9200/_snapshot/elastic_backup/2017-02-13/ to verifiy if the snapshot was available;
(5) Run POST localhost:9200/_snapshot/elastic_backup/2017-02-13/_restore

{
	"ignore_unavailable": true
}

and then it starts the restoring process.

(6) Run GET localhost:9200/delfos_index_homologacao/_recovery?human to verify the restoring execution.

Everything seemed to be working fine until I got the exception (I've selected just some parts of the stacktrace):

[2017-02-15T13:36:09,946][WARN ][o.e.i.c.IndicesClusterStateService] [Qeg0vLX] [[delfos_index_homologacao][2]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [delfos_index_homologacao][2]: Recovery failed on {Qeg0vLX}{Qeg0vLXlRFiwrO3vu1ul1w}{nXA02NvZQCSjB38jq7MvkA}{127.0.0.1}{127.0.0.1:9300}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1509) ~[elasticsearch-5.2.0.jar:5.2.0]

Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vq83pj actual=1p32p42 (resource=name [_1tmh_Lucene50_0.pay], length [529625211], checksum [1vq83pj], writtenBy [6.3.0]) (resource=VerifyingIndexOutput(_1tmh_Lucene50_0.pay))
	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1149) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1128) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1157) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restoreFile(BlobStoreRepository.java:1727) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restore(BlobStoreRepository.java:1665) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:980) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:400) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1241) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1505) ~[elasticsearch-5.2.0.jar:5.2.0]
	... 4 more
[2017-02-15T13:36:11,446][WARN ][o.e.c.a.s.ShardStateAction] [Qeg0vLX] [delfos_index_homologacao][2] received shard failed for shard id [[delfos_index_homologacao][2]], allocation id [pYmRnBTxSUeTefMR9d_jpA], primary term [0], message [failed recovery], failure [RecoveryFailedException[[delfos_index_homologacao][2]: Recovery failed on {Qeg0vLX}{Qeg0vLXlRFiwrO3vu1ul1w}{nXA02NvZQCSjB38jq7MvkA}{127.0.0.1}{127.0.0.1:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [2017-02-13/F_Q8708SSqC9yHJZ_xNefw]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1vq83pj actual=1p32p42 (resource=name [_1tmh_Lucene50_0.pay], length [529625211], checksum [1vq83pj], writtenBy [6.3.0]) (resource=VerifyingIndexOutput(_1tmh_Lucene50_0.pay))]; ]
org.elasticsearch.indices.recovery.RecoveryFailedException: [delfos_index_homologacao][2]: Recovery failed on {Qeg0vLX}{Qeg0vLXlRFiwrO3vu1ul1w}{nXA02NvZQCSjB38jq7MvkA}{127.0.0.1}{127.0.0.1:9300}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1509) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_66]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_66]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_66]

Some considerations:
1 - The index is composed of 5 shards but the GET localhost:9200/delfos_index_homologacao/_recovery?human only shows 4.
2 - I have set the -Des.max-open-files=true in the jvm-options, because maybe it could be related to file descriptors, but the error persisted.
3 - The only property that is set at elasticsearch.yml is the path.repo which addresses to the diectory where I copied the snapshot data.

Questions:
(1) Does anybody know if it has to do with the different file systems the snapshot was created and was restored?
(2) Why only 4 shards are being returned at GET _recovery?
(3) Is there a possibility that the snapshot was corrupted during its creation or during its copy to the external HD?

Thanks in advance,

Guilherme

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.