Restore of elasticsearch data fails with CorruptIndexException[checksum failed (hardware problem?)

Hi Team,

We trying to restore a backup that we had taken a couple of days ago on a freshly installed elasticsearch cluster. The steps we followed are mentioned below:

  1. Installed an elasticsearch chart and pushed some data

2)Took back up after a couple of days

3)Deleted the chart. Also, deleted the volumes.

4)Re-installed the chart

5)Tried to restore the backup tar that was generated after taking backup in step 2

Here, when we try to restore the tar, the restore command fails as we have a check in are post-restore script where we are checking for .snapshot.shards.failed in the restore API response and if there is any failed shard, the restore fails.

We notice following errors in es-master pod's log:

{"type":"log","host":"elasticsearch-master-0","level":"WARN","systemid":"xx","system":"xxxxx","time": "2021-01-11T12:57:35.832Z","logger":"o.e.c.r.a.AllocationService","timezone":"UTC","marker":"[elasticsearch-master-0] ","log":

{"message":"failing shard [failed shard, shard [log-default-2021.01.06-restored_1610369773][0], node[rtI9WKPhQKC2kGYIYe7VVA], [P], recovery_source[snapshot recovery [eZWe2MSCQYioReM-353mZg] from es_backup:es-snapshot-2021.01.11-09:56:30/ZW4vNG0HRluaLk2uAjzEaQ], s[INITIALIZING], a[id=HBxXRVRjT9mrjSSxehGwiw], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2021-01-11T12:56:16.135Z], delayed=false, *details[restore_source[es_backup/es-snapshot-2021.01.11-09:56:30]]*, allocation_status[deciders_throttled]], message [failed recovery], failure [RecoveryFailedException[[log-default-2021.01.06-restored_1610369773][0]: Recovery failed on \{elasticsearch-data-1} {rtI9WKPhQKC2kGYIYe7VVA}{1DoYGQPcSHOXUmG7v3rE8Q}{172.30.177.183}{172.30.177.183:9300}{d}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [es-snapshot-2021.01.11-09:56:30/ZW4vNG0HRluaLk2uAjzEaQ]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=nfihfd actual=1lqjue8 (resource=name [_1e.cfs], length [76044609], checksum [nfihfd], writtenBy [8.5.1]) (resource=VerifyingIndexOutput(_1e.cfs))]; ], markAsStale [true]]"}}
org.elasticsearch.indices.recovery.RecoveryFailedException: [log-default-2021.01.06-restored_1610369773][0]: Recovery failed on {elasticsearch-data-1}{rtI9WKPhQKC2kGYIYe7VVA}
{1DoYGQPcSHOXUmG7v3rE8Q}

{172.30.177.183}
{172.30.177.183:9300}

{d}
at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:2644) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:362) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$8(StoreRecovery.java:484) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$59(BlobStoreRepository.java:1857) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:94) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:173) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.support.GroupedActionListener.onFailure(GroupedActionListener.java:83) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$fileQueueListener$61(BlobStoreRepository.java:1941) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:94) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:88) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:683) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) ~[elasticsearch-7.8.0.jar:7.8.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
... 17 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
... 15 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [es-snapshot-2021.01.11-09:56:30/ZW4vNG0HRluaLk2uAjzEaQ]
... 13 more
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=nfihfd actual=1lqjue8 (resource=name [_1e.cfs], length [76044609], checksum [nfihfd], writtenBy [8.5.1]) (resource=VerifyingIndexOutput(_1e.cfs))
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1197) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1175) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1205) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$7.restoreFile(BlobStoreRepository.java:1911) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$7.lambda$restoreFiles$1(BlobStoreRepository.java:1883) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.8.0.jar:7.8.0]

The issue is that only with this backup tar we are noticing this issue but not with other backups.
What can be the cause of this issue and how we can overcome it?

Thanks in advance.

Best Regards,
Akshat

CorruptIndexException[checksum failed (hardware problem?) : expected=nfihfd actual=1lqjue8 (resource=name [_1e.cfs], length [76044609], checksum [nfihfd], writtenBy [8.5.1]) (resource=VerifyingIndexOutput(_1e.cfs))]

Your backup is corrupt, in the sense that its contents are not what Elasticsearch originally wrote. As the message suggests, this is likely a hardware problem: maybe the storage medium, maybe RAM, maybe something else entirely.

@DavidTurner Thank you for your quick response.

We have run the backup and restore in the same lab. The storage class and everything else, while restoring the data, is the same as it was when we took the backup.
How can the backup content change? What could be the possible reasons behind that? Any possible workaround that can help us restore this data?

Also, could you please guide me to any elesticsearch documentation that talks on this topic?

These things happen, there's lots of possibilities. Wikipedia has an article that gives some more background on silent corruption on disks; a user here recently reported corruption due to bad RAM. We recently found a kernel bug that causes corruption. I've also seen corruption introduced by bad/buggy storage controllers.

1 Like

@DavidTurner Thank you for the update.
Any idea how we can recover from this error once encountered? or any idea how we can prevent this from happening?

There's not much you can do to recover this specific data. You could try restoring it again in the hope that the corruption happened when reading data rather than when writing it, but I wouldn't be very hopeful of success there. If the data that's written is wrong then the right data is gone.

To prevent it from happening you need to work out why the data was corrupted. If it was due to a faulty disk/drive controller/RAM/etc then throw the the faulty component away. If it was that kernel bug, or some other bug, then upgrade the buggy software. Chasing this sort of thing down isn't really something we can help you with on these forums, it's more of a general sysadmin thing.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.