Hi, i'm encountering a problem when restoring ES snapshot into an empty cluster, some of the indices can't be restored due to org.apache.lucene.index.CorruptIndexException checksum failed (i'm using ES version 8.10)
What i didn't fully understand is when exactly ES computes and store a checksum, is it happening:
when a segment is written
merging of segments
creating the snapshot (or it simply 'copy' the files from data dir to backup dir including the checksum) ?
Btw, in my case data dir is stored in file system type 'ext4' and backup dir is on file system type 'ext3'. Is this sub-optimal setup and could be the reason for the corrupt index issue?
We verify the previously-created checksum at this point but otherwise just copy the checksum-carrying files verbatim to the snapshot.
The rest is right tho, we create the checksum footer whenever creating a new file, which includes creating a fresh segment and merging segments together.
It shouldn't, ES uses no filesystem-specific features that could possibly matter here. A checksum error really just means that the data Elasticsearch reads from the repository is different from the data it originally wrote.
When you say ' We verify the previously-created checksum' ...do you mean before copying the files to the snapshot ES will re-calculate the checksum and in case it doesn't match snapshot will fail?
Btw, i'm using shared file system repository (NFS mounted dir) as snapshot repository.
In case it shed more light to the problem.... in the log i see that 'writtenLength' and 'expectedLength' are the same, footer=, verification failed (hardware problem?) expected vs actual is different.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.