Hi Elastic Team,
I am facing a repository corruption issue on Elasticsearch 8.5.3 and would appreciate guidance on how to perform a "surgical" repair of our snapshot metadata.
The Situation:
Cluster Scale: Our live cluster holds approximately 6.5 TB of data.
Metadata Scale: The repository metadata is extremely large listing snapshots results in ~1.3 million lines of JSON.
The Cause: An S3 Lifecycle Policy was active on the bucket and deleted older objects that the repository still references.
Why a fresh repository is not a viable option for us:
Storage Impact: If we create a new repo, the first full snapshot would require uploading 6.5 TB to S3, which is a massive operation in terms of time and cost.
Data Retention Gap: Our cluster policy deletes live indices older than 90 days once they are backed up. If we start a fresh repo, we lose all data older than 90 days because that historical data only exists in this current, partially corrupted repository.
Restore Risk: We understand that while backups are incremental, restores are not. If we try to restore snapshots older than 90 days to "re-seed" a new repo, we are afraid it will fail because S3 might already be missing references to necessary metadata. It also requires massive disk space that we want to avoid using.
How we noticed the issue:
Our daily snapshot policy began failing with this
INTERNAL_SERVER_ERROR:
NoSuchFileException[Blob object [logs02-02/indices/PkoCOa-6T5S-jdgA5PREXA/0/index-rzgmdcDFQY2S5hZLn1aPBQ] not found: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey)]
Specific missing snapshot error:
{
"type": "snapshot_missing_exception",
"reason": "[bo-elk-backup:logs02-02-2023.07.01-gpq2m0k1sf-mqo_e9ghfbw/efJOWaAuRtSsx--S4kksaQ] is missing",
"caused_by": {
"type": "no_such_file_exception",
"reason": "Blob object [logs02-02/snap-efJOWaAuRtSsx--S4kksaQ.dat] not found"
}
}
What we have verified:
- Verification: POST /_snapshot/bo-elk-backup/_verify passes successfully.
- S3 Config: Only one cluster has write access to this repository.
- Lifecycle: We have now disabled the S3 Lifecycle policy to prevent further loss.
Our Questions:
-
Is there a way to prune references to these missing blobs to restore repository health without a full re-upload?
-
Does the _cleanup API in 8.5.3 effectively rewrite the index-N files if we manually delete the snapshots ( manually deleting snapshots also seems tricky since idk how many of them would be missing data ?
I am happy to provide full server-side logs if needed to help diagnose the exact point of failure during metadata parsing.
Thanks you so much in advance for all the suggestions .