Hi everyone,
I got an exception while restoring a snapshot in ES 5.2.
The snapshot creation and restoration were made following the elasticsearch recommendation available at https://www.elastic.co/guide/en/elasticsearch/reference/5.2/modules-snapshots.html
The snapshot was created in a Ubuntu machine and ES 5.1. After that, I copied the "path.repo" content (almost 26 Gb) to a "mac os extended journaled" external HD so that I could restore it in a Mac OSX (Sierra 10.12.3).
In the Mac OS,
(1) I copied the snapshot content to the respective ES 5.2 "path.repo" directory;
(2) Restart ES 5.2;
(3) Run PUT localhost:9200/_snapshot/elastic_backup
{
"type": "fs",
"settings": {
"location": <path.repo>,
"compress": true
}
}
(4) Run GET localhost:9200/_snapshot/elastic_backup/2017-02-13/ to verifiy if the snapshot was available;
(5) Run POST localhost:9200/_snapshot/elastic_backup/2017-02-13/_restore
{
"ignore_unavailable": true
}
and then it starts the restoring process.
(6) Run GET localhost:9200/delfos_index_homologacao/_recovery?human to verify the restoring execution.
Everything seemed to be working fine until I got the exception (I've selected just some parts of the stacktrace):
[2017-02-15T13:36:09,946][WARN ][o.e.i.c.IndicesClusterStateService] [Qeg0vLX] [[delfos_index_homologacao][2]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [delfos_index_homologacao][2]: Recovery failed on {Qeg0vLX}{Qeg0vLXlRFiwrO3vu1ul1w}{nXA02NvZQCSjB38jq7MvkA}{127.0.0.1}{127.0.0.1:9300}
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1509) ~[elasticsearch-5.2.0.jar:5.2.0]
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vq83pj actual=1p32p42 (resource=name [_1tmh_Lucene50_0.pay], length [529625211], checksum [1vq83pj], writtenBy [6.3.0]) (resource=VerifyingIndexOutput(_1tmh_Lucene50_0.pay))
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1149) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1128) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1157) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restoreFile(BlobStoreRepository.java:1727) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$RestoreContext.restore(BlobStoreRepository.java:1665) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:980) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:400) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$4(StoreRecovery.java:234) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:232) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1241) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1505) ~[elasticsearch-5.2.0.jar:5.2.0]
... 4 more
[2017-02-15T13:36:11,446][WARN ][o.e.c.a.s.ShardStateAction] [Qeg0vLX] [delfos_index_homologacao][2] received shard failed for shard id [[delfos_index_homologacao][2]], allocation id [pYmRnBTxSUeTefMR9d_jpA], primary term [0], message [failed recovery], failure [RecoveryFailedException[[delfos_index_homologacao][2]: Recovery failed on {Qeg0vLX}{Qeg0vLXlRFiwrO3vu1ul1w}{nXA02NvZQCSjB38jq7MvkA}{127.0.0.1}{127.0.0.1:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [2017-02-13/F_Q8708SSqC9yHJZ_xNefw]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1vq83pj actual=1p32p42 (resource=name [_1tmh_Lucene50_0.pay], length [529625211], checksum [1vq83pj], writtenBy [6.3.0]) (resource=VerifyingIndexOutput(_1tmh_Lucene50_0.pay))]; ]
org.elasticsearch.indices.recovery.RecoveryFailedException: [delfos_index_homologacao][2]: Recovery failed on {Qeg0vLX}{Qeg0vLXlRFiwrO3vu1ul1w}{nXA02NvZQCSjB38jq7MvkA}{127.0.0.1}{127.0.0.1:9300}
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1509) ~[elasticsearch-5.2.0.jar:5.2.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.0.jar:5.2.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_66]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_66]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_66]
Some considerations:
1 - The index is composed of 5 shards but the GET localhost:9200/delfos_index_homologacao/_recovery?human only shows 4.
2 - I have set the -Des.max-open-files=true in the jvm-options, because maybe it could be related to file descriptors, but the error persisted.
3 - The only property that is set at elasticsearch.yml is the path.repo which addresses to the diectory where I copied the snapshot data.
Questions:
(1) Does anybody know if it has to do with the different file systems the snapshot was created and was restored?
(2) Why only 4 shards are being returned at GET _recovery?
(3) Is there a possibility that the snapshot was corrupted during its creation or during its copy to the external HD?
Thanks in advance,
Guilherme