We currently have an elasticsearch cluster deployed on-prem kubernetes with the elastic operator.
Our main goals are exactly as described in the documentation:
- Regularly back up a cluster with no downtime
- Recover data after deletion or a hardware failure
I have a few questions I am struggling to find the answers to and wonder if the community can help me out:
- What control do I have over what is included in the snapshot - how granular does it get? - ex. I have 1 week index retention/lifecycle in my cluster, can I take daily snapshots which only include shards for yesterday/last 24hr OR every snapshot includes all indexes and all shards?
- Similarly to 1), what control do I have over what can be restored - for example if I have lost/deleted a single shard/index - can I only restore that shard/index or do I have to perform a whole cluster restore?
- After a hardware failure where all elasticsearch nodes and data has been lost - how does the recovery process look like - redeploy the cluster to kubernetes again, and restore the latest snapshot on top of it?
- If 1) is achievable and we do take snapshots of a days’ shards - how do I restore a whole cluster after hardware failure - can I only use the latest snapshot or can multiple snapshots be restored (ex. last 5 days’ worth of snaps)?
Any other ideas around preparing for DR / better approaches are welcome as well - thanks !