If an index segment that is referenced by snapshots becomes corrupt/missing from the snapshot repository, do you have to erase everything in the repository and start over?
Is there a way to resolve just that index's issues, without impact the other indices in the snapshots?
Background:
Our S3 repository had old segments for 2 indices become corrupted/missing (no_such_file_exception errors). All subsequent snapshots for these indices to that repo failed until we eventually emptied the repository and started over.
Does the Snapshot & Restore process have any alternatives to wiping this repo and starting again when a snapshotted segment itself becomes corrupt?
It does not seem like future snapshots will "retake" a snapshot of corrupted/missing segments and the guidance I've found here typically is to erase the repo and start over if you want to backup that index.
The only truly safe way to handle a broken repository is indeed to start again.
That said, I expect in many cases of shard- or index-level repository corruption the repository should start working again if you delete all the snapshots that involve the broken index. You can keep hold of the other data in the repository by cloning all the snapshots that involve the broken index, specifying *,-broken_index to remove just the broken index from the clone. Then delete all the bad snapshots.
From 8.16.0 onwards Elasticsearch will make an attempt to repair the repository contents if the missing data can be reconstructed from other blobs in the repository. It also adds an API to verify the integrity of the repository so you can proactively look for problems like these, and check that the repository contents are valid after the clone-and-delete process I suggested above.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.