Issues with snapshots

I am struggling to understand how one is expected to use the snapshot system to provide a reliable back up.

I understand that individual snapshots are incremental, but presumably only within repositories?

I have set up 3 repositories daily, monthly and yearly.

daily has a retention period of 31 days runs once a day
monthly has 366 days runs on the first of the month
yearly no expiry , runs on jan 1st

my ES cluster has about 7 TB of disk
and the disk holding the repositories has several times that.

A few days ago the backup disk filled up and snapshots started failing.

Looking at the list of snapshots in kibana I have daily snapshots for the 4 weeks (approx) 9 monthly backups (because the policy was fixed 9 months ago;) and two yearly back ups...

First question: Is the way I set up the policies sane? or have i misunderstood something fundamental?

second question: why are the backups using so much disk if they are incremental?

[sudo] password for rful011: 
22T	/data/esbackups/daily
24T	/data/esbackups/monthly
4.5T /data/esbackups/yealy

Lastly is there a way of seeing how much disk space a particular snapshot is using?

This is not correct. Each Elasticsearch snapshot is a full snapshot.

Each shard in Elasticsearch is a Lucene index, and this is built based on immutable segments that are merged into new immutable segments as data is added, deleted or modified. The incremental aspect of snapshots is that segments that have already been snapshotted and still exists are reused in subsequent snapshots and not copied into the repository repeatedly. The repository keeps track of which segments are in use by which snapshots and only delete the segment once it is no longer in use by any snapshot.

Once indices no longer receive data or updates and effectively become read only, the segments will no longer change and not add to the repository when new snapshots are taken. If you however have indices that are continously changing new segments will continously be created and the repository will likely have the same data in multiple different segments and therefore take up more space.

This blog post is very, very old, but I believe it still explains the basic principles (still largely valid) quite well.

The best way to handle snapshots therefore depend on your use case. Can you share some details about your use case and what your requirements around snapshots are?

1 Like

Thanks, I had completely misinterpreted what was meant by incremental...

my use case is pretty straight forward. Actually a poor man's (University ) siem...

I have one data stream which we want to keep long term - 7 years.
There are also a couple of beats feeds that which have shorter retention times (1 - 3 months).
lastly there is a very large feed (from Arkime -- full packet capture app) where indexes are rolled evy day and kept for 7 days.

Backups are taken mostly as disaster protection, but it there is the thought that it would be good to be able to restore indexes from backup that have been deleted by ILM process. In at least 5 years of operation we have never need to do that.

The only data we really care about is the single datastream.

If I could I would be happy to exclude the arkime session indexes ~ 500GB/day!

What is the retention period for the data stream in the cluster?

Given that you can select index patterns to include in a snapshot, I would probably suggest aligning the different snapshots with retention period.

  • Create one snapshot that covers all the indices. Take this every day and keep for a bit over a week or so. This will allow you to restore all the current indices if you suffer a failure.
  • Create another repository that excludes the large Arkime indices. As the last week of all indices is covered by the first repository, you can probably take this snapshot once a week and keep it for 3 months or so.
  • Create a third repository that only includes the data stream that you want to retain for 7 years. As the other snapshots cover the last few months you can probably take this once a month and retain it for the full 7 years.

I think we removed the word "incremental" from the docs for this reason. We now say they are "deduplicated". See e.g. these docs:

  • Snapshots are automatically deduplicated. You can take frequent snapshots with little impact to your storage overhead.
  • Each snapshot is logically independent. You can delete a snapshot without affecting other snapshots.

This is a tricky question to answer (because of the deduplication). #56660 might be what you're after, but it's not implemented yet.

Thanks David! Pleased that the doc have been improved!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.