Restore of elasticsearch data fails with CorruptIndexException[checksum failed (hardware problem?)

akshatsharma · January 20, 2021, 7:57am

Hi Team,

We trying to restore a backup that we had taken a couple of days ago on a freshly installed elasticsearch cluster. The steps we followed are mentioned below:

Installed an elasticsearch chart and pushed some data

2)Took back up after a couple of days

3)Deleted the chart. Also, deleted the volumes.

4)Re-installed the chart

5)Tried to restore the backup tar that was generated after taking backup in step 2

Here, when we try to restore the tar, the restore command fails as we have a check in are post-restore script where we are checking for .snapshot.shards.failed in the restore API response and if there is any failed shard, the restore fails.

We notice following errors in es-master pod's log:

{"type":"log","host":"elasticsearch-master-0","level":"WARN","systemid":"xx","system":"xxxxx","time": "2021-01-11T12:57:35.832Z","logger":"o.e.c.r.a.AllocationService","timezone":"UTC","marker":"[elasticsearch-master-0] ","log":

{"message":"failing shard [failed shard, shard [log-default-2021.01.06-restored_1610369773][0], node[rtI9WKPhQKC2kGYIYe7VVA], [P], recovery_source[snapshot recovery [eZWe2MSCQYioReM-353mZg] from es_backup:es-snapshot-2021.01.11-09:56:30/ZW4vNG0HRluaLk2uAjzEaQ], s[INITIALIZING], a[id=HBxXRVRjT9mrjSSxehGwiw], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2021-01-11T12:56:16.135Z], delayed=false, *details[restore_source[es_backup/es-snapshot-2021.01.11-09:56:30]]*, allocation_status[deciders_throttled]], message [failed recovery], failure [RecoveryFailedException[[log-default-2021.01.06-restored_1610369773][0]: Recovery failed on \{elasticsearch-data-1} {rtI9WKPhQKC2kGYIYe7VVA}{1DoYGQPcSHOXUmG7v3rE8Q}{172.30.177.183}{172.30.177.183:9300}{d}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [es-snapshot-2021.01.11-09:56:30/ZW4vNG0HRluaLk2uAjzEaQ]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=nfihfd actual=1lqjue8 (resource=name [_1e.cfs], length [76044609], checksum [nfihfd], writtenBy [8.5.1]) (resource=VerifyingIndexOutput(_1e.cfs))]; ], markAsStale [true]]"}}
org.elasticsearch.indices.recovery.RecoveryFailedException: [log-default-2021.01.06-restored_1610369773][0]: Recovery failed on {elasticsearch-data-1}{rtI9WKPhQKC2kGYIYe7VVA}
{1DoYGQPcSHOXUmG7v3rE8Q}

{172.30.177.183}
{172.30.177.183:9300}

{d}
at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:2644) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:362) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$8(StoreRecovery.java:484) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$59(BlobStoreRepository.java:1857) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:94) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:173) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.support.GroupedActionListener.onFailure(GroupedActionListener.java:83) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$fileQueueListener$61(BlobStoreRepository.java:1941) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:94) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:88) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:683) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) ~[elasticsearch-7.8.0.jar:7.8.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
... 17 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
... 15 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [es-snapshot-2021.01.11-09:56:30/ZW4vNG0HRluaLk2uAjzEaQ]
... 13 more
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=nfihfd actual=1lqjue8 (resource=name [_1e.cfs], length [76044609], checksum [nfihfd], writtenBy [8.5.1]) (resource=VerifyingIndexOutput(_1e.cfs))
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1197) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1175) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1205) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$7.restoreFile(BlobStoreRepository.java:1911) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$7.lambda$restoreFiles$1(BlobStoreRepository.java:1883) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.8.0.jar:7.8.0]

The issue is that only with this backup tar we are noticing this issue but not with other backups.
What can be the cause of this issue and how we can overcome it?

Thanks in advance.

Best Regards,
Akshat

DavidTurner · January 20, 2021, 8:13am

CorruptIndexException[checksum failed (hardware problem?) : expected=nfihfd actual=1lqjue8 (resource=name [_1e.cfs], length [76044609], checksum [nfihfd], writtenBy [8.5.1]) (resource=VerifyingIndexOutput(_1e.cfs))]

Your backup is corrupt, in the sense that its contents are not what Elasticsearch originally wrote. As the message suggests, this is likely a hardware problem: maybe the storage medium, maybe RAM, maybe something else entirely.

akshatsharma · January 20, 2021, 10:39am

@DavidTurner Thank you for your quick response.

We have run the backup and restore in the same lab. The storage class and everything else, while restoring the data, is the same as it was when we took the backup.
How can the backup content change? What could be the possible reasons behind that? Any possible workaround that can help us restore this data?

Also, could you please guide me to any elesticsearch documentation that talks on this topic?

DavidTurner · January 20, 2021, 11:07am

These things happen, there's lots of possibilities. Wikipedia has an article that gives some more background on silent corruption on disks; a user here recently reported corruption due to bad RAM. We recently found a kernel bug that causes corruption. I've also seen corruption introduced by bad/buggy storage controllers.

akshatsharma · January 25, 2021, 6:35am

@DavidTurner Thank you for the update.
Any idea how we can recover from this error once encountered? or any idea how we can prevent this from happening?

DavidTurner · January 25, 2021, 9:00am

There's not much you can do to recover this specific data. You could try restoring it again in the hope that the corruption happened when reading data rather than when writing it, but I wouldn't be very hopeful of success there. If the data that's written is wrong then the right data is gone.

To prevent it from happening you need to work out why the data was corrupted. If it was due to a faulty disk/drive controller/RAM/etc then throw the the faulty component away. If it was that kernel bug, or some other bug, then upgrade the buggy software. Chasing this sort of thing down isn't really something we can help you with on these forums, it's more of a general sysadmin thing.

system · February 22, 2021, 9:01am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Exception at backup restoring: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) Elasticsearch	1	1506	March 15, 2017
CorruptIndexException on restore Elasticsearch	3	444	January 16, 2017
Index shard got corrupted Elasticsearch	3	3153	July 6, 2017
Corrupt index, checksum failed Elasticsearch	1	1082	July 6, 2017
Snapshot restore failure elasticsearch 6.8.7 Elasticsearch	3	657	April 14, 2020

Restore of elasticsearch data fails with CorruptIndexException[checksum failed (hardware problem?)

Related topics