Exception at backup restoring: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?)


(Guilherme Maranhao) #1

Hi everyone,

I got an exception while restoring a snapshot in ES 5.2.

The snapshot creation and restoration were made following the elasticsearch recommendation available at https://www.elastic.co/guide/en/elasticsearch/reference/5.2/modules-snapshots.html

The snapshot was created in a Ubuntu machine and ES 5.1. After that, I copied the "path.repo" content (almost 26 Gb) to a "mac os extended journaled" external HD so that I could restore it in a Mac OSX (Sierra 10.12.3).

In the Mac OS,
(1) I copied the snapshot content to the respective ES 5.2 "path.repo" directory;
(2) Restart ES 5.2;
(3) Run PUT localhost:9200/_snapshot/elastic_backup

{
    "type": "fs",
    "settings": {
        "location": <path.repo>,
        "compress": true
    }
}

(4) Run GET localhost:9200/_snapshot/elastic_backup/2017-02-13/ to verifiy if the snapshot was available;
(5) Run POST localhost:9200/_snapshot/elastic_backup/2017-02-13/_restore

{
	"ignore_unavailable": true
}

and then it starts the restoring process.

(6) Run GET localhost:9200/delfos_index_homologacao/_recovery?human to verify the restoring execution.

Everything seemed to be working fine until I got the exception (I've selected just some parts of the stacktrace):

[2017-02-15T13:36:09,946][WARN ][o.e.i.c.IndicesClusterStateService] [Qeg0vLX] [[delfos_index_homologacao][2]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [delfos_index_homologacao][2]: Recovery failed on {Qeg0vLX}{Qeg0vLXlRFiwrO3vu1ul1w}{nXA02NvZQCSjB38jq7MvkA}{127.0.0.1}{127.0.0.1:9300}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1509) ~[elasticsearch-5.2.0.jar:5.2.0]

Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vq83pj actual=1p32p42 (resource=name [_1tmh_Lucene50_0.pay], length [529625211], checksum [1vq83pj], writtenBy [6.3.0]) (resource=VerifyingIndexOutput(_1tmh_Lucene50_0.pay))
	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1149) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1128) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1157) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restoreFile(BlobStoreRepository.java:1727) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restore(BlobStoreRepository.java:1665) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:980) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:400) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1241) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1505) ~[elasticsearch-5.2.0.jar:5.2.0]
	... 4 more
[2017-02-15T13:36:11,446][WARN ][o.e.c.a.s.ShardStateAction] [Qeg0vLX] [delfos_index_homologacao][2] received shard failed for shard id [[delfos_index_homologacao][2]], allocation id [pYmRnBTxSUeTefMR9d_jpA], primary term [0], message [failed recovery], failure [RecoveryFailedException[[delfos_index_homologacao][2]: Recovery failed on {Qeg0vLX}{Qeg0vLXlRFiwrO3vu1ul1w}{nXA02NvZQCSjB38jq7MvkA}{127.0.0.1}{127.0.0.1:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [2017-02-13/F_Q8708SSqC9yHJZ_xNefw]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1vq83pj actual=1p32p42 (resource=name [_1tmh_Lucene50_0.pay], length [529625211], checksum [1vq83pj], writtenBy [6.3.0]) (resource=VerifyingIndexOutput(_1tmh_Lucene50_0.pay))]; ]
org.elasticsearch.indices.recovery.RecoveryFailedException: [delfos_index_homologacao][2]: Recovery failed on {Qeg0vLX}{Qeg0vLXlRFiwrO3vu1ul1w}{nXA02NvZQCSjB38jq7MvkA}{127.0.0.1}{127.0.0.1:9300}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1509) ~[elasticsearch-5.2.0.jar:5.2.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_66]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_66]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_66]

Some considerations:
1 - The index is composed of 5 shards but the GET localhost:9200/delfos_index_homologacao/_recovery?human only shows 4.
2 - I have set the -Des.max-open-files=true in the jvm-options, because maybe it could be related to file descriptors, but the error persisted.
3 - The only property that is set at elasticsearch.yml is the path.repo which addresses to the diectory where I copied the snapshot data.

Questions:
(1) Does anybody know if it has to do with the different file systems the snapshot was created and was restored?
(2) Why only 4 shards are being returned at GET _recovery?
(3) Is there a possibility that the snapshot was corrupted during its creation or during its copy to the external HD?

Thanks in advance,

Guilherme


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.