Questions about backup strategy

Hi folks,

In my last project, a backup of the indices had been quite unimportant. If gone it's gone, so I only did a snapshot of kibana index once a day.

For my new project I need to create a backup plan. Not loosing data is more important.

Let's say we have an retention time of accessible data in ES for 30 days for timestamp based indices. Additionally we have some entity-centric indices, which are not rotated.

As example requirement let's say I need a backup of the system and need to be able to restore the cluster state for somewhere in last 7 days. What is the right approach?

My understanding looks like the following, please give hints, correct me or point to pitfalls:

  • snapshot is defined to backup all indices
    • is it best practice to snapshot all indices or to exclude the .monitoring-* indices?
  • as snapshots are defined to be lightwight lets say we run them once an hour
    • is there a best practice for max frequency?
    • how lightweight are they? Will it significantly reduce index or query performance when it is running?
  • automatically delete snapshots after 37 days (30 day retention time + 7 days restore window)
    • documentation says: "When a snapshot is deleted from a repository, Elasticsearch deletes all files that are associated with the deleted snapshot and not used by any other snapshots" -> what will be happen if a snapshot contains entity-centric indices like kibana index? What will happen to the snapshot?
      • will it be merged with following snapshots?
      • will it be deleted or will it stay forever?
      • if it stays end the snapshot contains data of entity centric index and by timestamp rotated indices, will the data of rotated indices be deleted inside the snapshot which is marked for deletion, do make it smaller?

If it's unclear what I am asking for, give me a comment, I will try to explain it more specific.

Additional question:
In relational databases like oracle we have the possibility to write archive and redo logs. So in case of restore you can restore without any data loss until the last transaction before the database crashed. Is there some equivalent in elasticsearch, or is the only possibility to run a small snapshot each few minutes to keep the timespan of data loss as small as possible? What are best practices here if a customer says data loss is inacceptable for me?

Thanks a lot, Andreas

It's normally a good idea to snapshot everything, but it's also recommended to monitor your cluster using a separate monitoring cluster. From the docs:

In production, you should send data to a separate monitoring cluster so that historical monitoring data is available even if the nodes you are monitoring are not.

Taking a snapshot every 30 minutes is not unusual.

They're mostly blindly copying files that already exist on disk, rather than doing anything computationally expensive. Additionally, the max_snapshot_bytes_per_sec setting limits how fast the snapshot runs. This defaults to 40MB/s, but you can reduce this if you see an excessive performance impact.

A snapshot is made up of multiple objects and each object may belong to more than one snapshot. In a filesystem repository "object" roughly means "file". When taking a new snapshot, it will re-use as many objects as possible from those that already exist in the repository rather than making new copies of this data. Each object is deleted once all of the snapshots to which it belongs are deleted.

Each object belongs to a single index, so if you delete all the snapshots that refer to a particular index then all of the objects for that index will be deleted.

Yes, the equivalent is for each shard to have replicas, which they do by default.

Hey @DavidTurner,
thanks for your fast reply.

But the replica do only help against a number of failing nodes, right? So when my nodes which have the primary shards are failing, the data is still available in replicas.
But in case of some software issue or a user who deletes everything, then I can go only back to my latest valid snapshot while with redo logs I could theoretically restore also the time between last snapshot / backup and crash / accident. Correct?

I just need to know it for correct communicating with project team and customer and to know the limits :wink:

Thanks, Andreas

I see. Yes, snapshots are the only real remedy for this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.