Restore snapshot checksum problem (Troubleshooting corruption)

Hi, i'm encountering a problem when restoring ES snapshot into an empty cluster, some of the indices can't be restored due to org.apache.lucene.index.CorruptIndexException checksum failed (i'm using ES version 8.10)

I'm aware of the text around this issue (Troubleshooting corruption | Elasticsearch Guide [8.10] | Elastic) and also few people encountered similar problem which most often boils down to problem in HW (RAM, bad sectors etc).

What i didn't fully understand is when exactly ES computes and store a checksum, is it happening:

  • when a segment is written
  • merging of segments
  • creating the snapshot (or it simply 'copy' the files from data dir to backup dir including the checksum) ?

Btw, in my case data dir is stored in file system type 'ext4' and backup dir is on file system type 'ext3'. Is this sub-optimal setup and could be the reason for the corrupt index issue?

Thanks.

Added snapshot-and-restore

Using different file systems (ext4 for data and ext3 for backup) can contribute to corruption issues.

Not quite:

We verify the previously-created checksum at this point but otherwise just copy the checksum-carrying files verbatim to the snapshot.

The rest is right tho, we create the checksum footer whenever creating a new file, which includes creating a fresh segment and merging segments together.

It shouldn't, ES uses no filesystem-specific features that could possibly matter here. A checksum error really just means that the data Elasticsearch reads from the repository is different from the data it originally wrote.

Thanks for getting back to me.

When you say ' We verify the previously-created checksum' ...do you mean before copying the files to the snapshot ES will re-calculate the checksum and in case it doesn't match snapshot will fail?

Btw, i'm using shared file system repository (NFS mounted dir) as snapshot repository.

In case it shed more light to the problem.... in the log i see that 'writtenLength' and 'expectedLength' are the same, footer=, verification failed (hardware problem?) expected vs actual is different.

Yes.