Hi Team,
We trying to restore a backup that we had taken a couple of days ago on a freshly installed elasticsearch cluster. The steps we followed are mentioned below:
- Installed an elasticsearch chart and pushed some data
2)Took back up after a couple of days
3)Deleted the chart. Also, deleted the volumes.
4)Re-installed the chart
5)Tried to restore the backup tar that was generated after taking backup in step 2
Here, when we try to restore the tar, the restore command fails as we have a check in are post-restore script where we are checking for .snapshot.shards.failed in the restore API response and if there is any failed shard, the restore fails.
We notice following errors in es-master pod's log:
{"type":"log","host":"elasticsearch-master-0","level":"WARN","systemid":"xx","system":"xxxxx","time": "2021-01-11T12:57:35.832Z","logger":"o.e.c.r.a.AllocationService","timezone":"UTC","marker":"[elasticsearch-master-0] ","log":
{"message":"failing shard [failed shard, shard [log-default-2021.01.06-restored_1610369773][0], node[rtI9WKPhQKC2kGYIYe7VVA], [P], recovery_source[snapshot recovery [eZWe2MSCQYioReM-353mZg] from es_backup:es-snapshot-2021.01.11-09:56:30/ZW4vNG0HRluaLk2uAjzEaQ], s[INITIALIZING], a[id=HBxXRVRjT9mrjSSxehGwiw], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2021-01-11T12:56:16.135Z], delayed=false, *details[restore_source[es_backup/es-snapshot-2021.01.11-09:56:30]]*, allocation_status[deciders_throttled]], message [failed recovery], failure [RecoveryFailedException[[log-default-2021.01.06-restored_1610369773][0]: Recovery failed on \{elasticsearch-data-1} {rtI9WKPhQKC2kGYIYe7VVA}{1DoYGQPcSHOXUmG7v3rE8Q}{172.30.177.183}{172.30.177.183:9300}{d}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [es-snapshot-2021.01.11-09:56:30/ZW4vNG0HRluaLk2uAjzEaQ]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=nfihfd actual=1lqjue8 (resource=name [_1e.cfs], length [76044609], checksum [nfihfd], writtenBy [8.5.1]) (resource=VerifyingIndexOutput(_1e.cfs))]; ], markAsStale [true]]"}}
org.elasticsearch.indices.recovery.RecoveryFailedException: [log-default-2021.01.06-restored_1610369773][0]: Recovery failed on {elasticsearch-data-1}{rtI9WKPhQKC2kGYIYe7VVA}
{1DoYGQPcSHOXUmG7v3rE8Q}
{172.30.177.183}
{172.30.177.183:9300}
{d}
at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:2644) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:362) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$8(StoreRecovery.java:484) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$59(BlobStoreRepository.java:1857) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:94) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:173) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.support.GroupedActionListener.onFailure(GroupedActionListener.java:83) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$fileQueueListener$61(BlobStoreRepository.java:1941) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:94) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:88) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:683) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) ~[elasticsearch-7.8.0.jar:7.8.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
... 17 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
... 15 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [es-snapshot-2021.01.11-09:56:30/ZW4vNG0HRluaLk2uAjzEaQ]
... 13 more
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=nfihfd actual=1lqjue8 (resource=name [_1e.cfs], length [76044609], checksum [nfihfd], writtenBy [8.5.1]) (resource=VerifyingIndexOutput(_1e.cfs))
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1197) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1175) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1205) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$7.restoreFile(BlobStoreRepository.java:1911) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$7.lambda$restoreFiles$1(BlobStoreRepository.java:1883) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.8.0.jar:7.8.0]
The issue is that only with this backup tar we are noticing this issue but not with other backups.
What can be the cause of this issue and how we can overcome it?
Thanks in advance.
Best Regards,
Akshat