GCS Snapshot restoration not working

Hi!

I inherited an elastic cluster set up before my time. Current taking a daily snapshot into GCS successfully, as far as I can see when navigating the GCS bucket. In fact, I see no error logs and no cause to believe the snapshots are incomplete. I wanted to restore a subset on indices and running into an error:

{
  "error": {
    "root_cause": [
      {
        "type": "snapshot_missing_exception",
        "reason": "[production-gcs:daily-snapshots-2024.03.11-zrq9hu6btggfblt_2gn31w/7Bc9dEM5QFKdE2foHtxeew] is missing"
      }
    ],
    "type": "snapshot_missing_exception",
    "reason": "[production-gcs:daily-snapshots-2024.03.11-zrq9hu6btggfblt_2gn31w/7Bc9dEM5QFKdE2foHtxeew] is missing",
    "caused_by": {
      "type": "no_such_file_exception",
      "reason": """Blob object [indices/VAGKEGblS9KCZcH7tO0oXA/meta-iRylTYYB1eWKxpYBa6hZ.dat] not found: 404 Not Found
GET https://storage.googleapis.com/download/storage/v1/b/my-bucket/o/indices%2FVAGKEGblS9KCZcH7tO0oXA%2Fmeta-iRylTYYB1eWKxpYBa6hZ.dat?alt=media
No such object: my-bucket/indices/VAGKEGblS9KCZcH7tO0oXA/meta-iRylTYYB1eWKxpYBa6hZ.dat"""
    }
  },
  "status": 404
}

I can attest my-bucket/indices/VAGKEGblS9KCZcH7tO0oXA/meta-iRylTYYB1eWKxpYBa6hZ.dat doesn't exist.

gsutil stat gs://my-bucket/indices/VAGKEGblS9KCZcH7tO0oXA/meta-iRylTYYB1eWKxpYBa6hZ.dat
No URLs matched: gs://my-bucket/indices/VAGKEGblS9KCZcH7tO0oXA/meta-iRylTYYB1eWKxpYBa6hZ.dat

I found related posts about this but no resolution or no docs. I see logs for missing related metadata files in my logs, like this

{"@timestamp":"2024-03-21T18:46:49.457Z", "log.level": "WARN", "message":"failed to fetch snapshot info", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[soc-elastic-server-es-master-1][snapshot_meta][T#9228]","log.logger":"org.elasticsearch.repositories.GetSnapshotInfoContext","trace.id":"f996d44932634a439388bd71c205a49d","elasticsearch.cluster.uuid":"ujdWpZmZS4ebNP1-RlkVFw","elasticsearch.node.id":"BV2mY-T7TV2gN1x5PW5HKQ","elasticsearch.node.name":"soc-elastic-server-es-master-1","elasticsearch.cluster.name":"soc-elastic-server","error.type":"org.elasticsearch.snapshots.SnapshotMissingException","error.message":"[production-gcs:daily-snapshots-2022.11.02-61ori2c-teeonnmqxcmjyg/aklfqoP-QTWpdn3YeLWr4g] is missing","error.stack_trace":"org.elasticsearch.snapshots.SnapshotMissingException: [production-gcs:daily-snapshots-2022.11.02-61ori2c-teeonnmqxcmjyg/aklfqoP-QTWpdn3YeLWr4g] is missing\n\tat org.elasticsearch.server@8.6.1/org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$getOneSnapshotInfo$52(BlobStoreRepository.java:1557)\n\tat org.elasticsearch.server@8.6.1/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:850)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1589)\nCaused by: java.nio.file.NoSuchFileException: Blob object [snap-aklfqoP-QTWpdn3YeLWr4g.dat] not found: 404 Not Found\nGET https://storage.googleapis.com/download/storage/v1/b/my-bucket/o/snap-aklfqoP-QTWpdn3YeLWr4g.dat?alt=media\nNo such object: my-bucket/snap-aklfqoP-QTWpdn3YeLWr4g.dat\n\tat org.elasticsearch.repositories.gcs.GoogleCloudStorageRetryingInputStream.openStream(GoogleCloudStorageRetryingInputStream.java:132)\n\tat org.elasticsearch.repositories.gcs.GoogleCloudStorageRetryingInputStream.<init>(GoogleCloudStorageRetryingInputStream.java:84)\n\tat org.elasticsearch.repositories.gcs.GoogleCloudStorageRetryingInputStream.<init>(GoogleCloudStorageRetryingInputStream.java:66)\n\tat org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.readBlob(GoogleCloudStorageBlobStore.java:209)\n\tat org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.readBlob(GoogleCloudStorageBlobContainer.java:63)\n\tat org.elasticsearch.server@8.6.1/org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:108)\n\tat org.elasticsearch.server@8.6.1/org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$getOneSnapshotInfo$52(BlobStoreRepository.java:1555)\n\t... 4 more\n"}

This makes me think we are not getting proper feedback on successful snapshotting, or even that we cannot recover the missing ones. Any help is appreciated.

I set up another bucket, my-bucket-2, created a new snapshot repository, did a manual snapshot and I was able to restore some features and indices from that one. Is it possible for a snapshot repo to become corrupted? That worries me a little with using snapshots as a safety and continuity mechanism. Not sure what I can do to restore the old data. I can live with some data loss, I cannot live with a broken backup mechanism.

I'm not aware of any known issues that could explain this in 8.6.1. Do you happen to have any snaphots from a very old (7.12.x or earlier) cluster in this repository? Not that this explains anything in itself but it's a different code path, might give a hint where to look.

Also does anything else have access to this bucket to delete blobs? Are you using Object Lifecycle Management to expire old data?

I do not have backups from < 8.x.x, we deployed our elastic stack on the same major version.

I am confident nobody has access to the bucket storage. We use a service account to grant ELK Storage Object Admin and nobody except me has read/write access otherwise. We do use OLM to delete old objects, but this set for 365+ days. We change the tier after 90+ days, but as far as I know this only changes the performance. The snapshots I'm testing are a week old.

As I understand it, the snapshot mechanism seems incremental. If that's the case, how far apart are full baselines created? If the last baseline was > 1 year old, could this explain why I cannot restore some features?

Ah right that'd explain it. See these docs:

Don’t modify anything within the repository or run processes that might interfere with its contents. If something other than Elasticsearch modifies the contents of the repository then future snapshot or restore operations may fail, reporting corruption or other data inconsistencies, or may appear to succeed having silently lost some of your data.

No, that's not correct, all snapshots are (logically) independent. However, there is some deduplication to avoid unnecessary uploads, which means that recent snapshots may depend on blobs that are much older. In this case, more than a year old, so OLM has deleted it.

Ahhh shucks. Well lesson learned the hard way.

Thankfully, I set up a new bucket without the OLM policy and was successful in making a snapshot and restoring it. This is our way forward I guess.

Huge thanks for clarifying some of my assumptions, really glad I'm getting closure on this.

Yep sorry to say there's no way to repair a repository that's had this kind of damage unfortunately. You may find it's possible to restore data for some indices from it, but not reliably, so starting again with a fresh repository is the best path forwards.

1 Like