Backup & retention strategy

Hello,

we are new to the Elastic Stack and we're trying to design our backup & data retention strategy. Our use case is for logs and so we have the standard index per day setup.

Before delving into Elasticsearch internals our business requirements are that we want recent logs fast to search, old logs up to some point backed up somewhere but we also want everything backed up and possible to restore in case of a fully catastrophic cluster failure.

After reading the relevant documentation we have come up with the following plan on how to leverage Elasticsearch features to cover all our requirements:

There are 4 stages our logs enter depending on their age:

  • Warm
  • fast searchable
  • high CPU cluster nodes
  • Cold
  • slow(er) searchable
  • low CPU cluster nodes
  • data compressed
  • data forcemerged
  • Warm backup
  • fast restore
  • indexes closed but still on disk
  • Cold backup
  • slow(er) restore
  • indexes deleted from disk
  • only available on snapshots

To ensure we have all the data backed up but our long term backups take up as little space as possible we're gonna use two snapshot repositories:

  • Long term
  • Daily snapshot indices after they have been moved to the Cold stage and so they have first been compressed and optimised
  • Short term
  • Snapshot indices in the Warm stage as often as possible
  • Keep some increasing back-off interval of snapshots around and delete the rest (e.g. if the snapshot runs every minute keep all minute snapshots of the last 20 minutes and all hourly snapshots for the last 24 hours)

So to implement this strategy the following steps are run daily:

  • Delete indices older than Warm backup max age
  • Close indices older than Cold max age
  • Reallocate to cold nodes indices older than Warm max age
  • Forcemerge indices older than Warm max age
  • Snapshot indices older than Warm max age
  • Delete snapshots older than Cold backup max age

And the following steps are run as often as possible

  • Snapshot warm indices
  • Delete unneeded snapshots as explained above in Short term

For the implementation we plan on using curator since it seems to be the best way of keeping this kind of configuration on a human readable format.

So... does this whole approach make sense? Have we missed any steps in the implementation? Is there some better or simpler way to do this?

Note that we have been largely inspired by this GitHub comment, it was a great read.

This approach makes sense to me. It's similar to what I usualy recommend, depending on the use case and requirement of course.