Advice on snapshot and restore setup for elasticsearch DR scenario

We currently have an elasticsearch cluster deployed on-prem kubernetes with the elastic operator.

Our main goals are exactly as described in the documentation:

  • Regularly back up a cluster with no downtime
  • Recover data after deletion or a hardware failure

I have a few questions I am struggling to find the answers to and wonder if the community can help me out:

  1. What control do I have over what is included in the snapshot - how granular does it get? - ex. I have 1 week index retention/lifecycle in my cluster, can I take daily snapshots which only include shards for yesterday/last 24hr OR every snapshot includes all indexes and all shards?
  2. Similarly to 1), what control do I have over what can be restored - for example if I have lost/deleted a single shard/index - can I only restore that shard/index or do I have to perform a whole cluster restore?
  3. After a hardware failure where all elasticsearch nodes and data has been lost - how does the recovery process look like - redeploy the cluster to kubernetes again, and restore the latest snapshot on top of it?
  4. If 1) is achievable and we do take snapshots of a days’ shards - how do I restore a whole cluster after hardware failure - can I only use the latest snapshot or can multiple snapshots be restored (ex. last 5 days’ worth of snaps)?

Any other ideas around preparing for DR / better approaches are welcome as well - thanks !

You can specify which indices should be included. But generally that’s a bad idea. Snapshots are deduplicated so it’s usually best to include everything - things that haven’t changed won’t cost you any extra storage.

Likewise, you can specify the indices to restore. Restoring one shard out of a multi-shard index doesn’t really make sense and is not supported.

Sounds about right.

You can restore multiple snapshots. But that’s just making life hard for yourself. Take full snapshots and rely on deduplication for the storage savings instead.