Snapshot Repository is 8x larger than total clustered data!?!?!

ivanlawrence · December 6, 2024, 10:08pm

My cluster has a total usage reported by kibana (kibana/app/monitoring#/overview) as 1TB (102 indices with one replica.) Snapshots go to backblaze (as the s3 default client) which is reporting 8TB stored.

I was snapping all indices hourly and just yesterday modified the daily job to only snap the few indices I care about and only a few times a day with a week-long retention.

The data doesn't change too often so I think mostly a once a week snap will be fine, but when it does change it basically overwrites the entire dataset (index not update.)

Should I:

Use a single policy to snap hourly and keep it for 45d or so?
Use multiple policies in the same repo with different schedules to snap a few times a day and also once a week?
Use multiple policies (one per index) allowing a restore of any single index instead of needing to restore all indices?

Based on the data changes I think we would basically be restoring back to a particular week/month set of data and not an hourly moment in time. Think inventory gets counted at the end of a month but is infrequently updated mid-month so weekly already feels like belt & suspenders.

I now have two policies:

daily 0 30 1 * * ? with retention of 7d, max count of 100
weekly 0 30 2 ? * 4 with retention 45d, no max

With the above changes I hoped the extra data would age out once the snap retention surpassed the last big indexing job... but I'm still seeing 8TB in the repo with the oldest snap being after the last ingest.

Please help me pick the right path forward?

ivanlawrence · December 18, 2024, 8:08am

Well I struck out on my own and nuked the data! With my uplink speed it took about 3 days to complete the new snapshots.

I first went into the storage provider's web UI and deleted all the files.

I either tried a new snap or deleted the snapshots using the Elasticsearch API (forgive me for not remembering) and was met with an error message saying the repo might be corrupt since the expected data wasn't there.

In kibana I tested the repo using the "verify repo" button which worked just fine. The error message linked to some article that basically said "you need to delete the repo and start over"

I deleted the config for the policy and repo, then I added them all back.
I did this by first a GET to get the config, then DELETE, followed by a POST with the previously output config.
This worked exactly as planned and I was able to complete the new snap (which is only 500GB instead of 9TB!

So I'm gonna call this solved.

Topic		Replies	Views
S3 bucket size holding the snapshots is 2-2.5x of the total disk used of the cluster Elasticsearch snapshot-and-restore	4	700	August 14, 2021
Issues with snapshots Elasticsearch	6	287	May 1, 2023
Snapshot Strategy for archival Elasticsearch	3	1044	September 7, 2017
Elasticsearch Snapshot repository size estimates Elasticsearch	4	6001	October 8, 2019
Are there any guidelines to estimate how many snapshots can be handled by elasticsearch with specific amount of memory/cpu Elasticsearch slm-snapshot-lifecycle-management , snapshot-and-restore	1	272	May 10, 2023

Snapshot Repository is 8x larger than total clustered data!?!?!

Related topics