We used verify the snapshots on daily bases to make sure they are in good state i.e has no corrupted data. We do this by spinning up a new cluster and restoring it to validate the we have at least one good snapshot.
However, as the data grows this becomes increasing complex - we have grown to 100+ nodes and 120+ TB of data - restoring one snapshot can take up to 15 hours and we need to spin up another production size cluster with 100 nodes for just validating a snapshot. We can move the validation to weekly/bi-weekly but wondering if there are better approaches to validating than spinning up prod size cluster.
I was looking forward to this but the issue is now closed: Validate snapshot via dry-run restore. Is there anything similar that exists?
Based on what i've read elastic has a checksum "check" at each snapshot but recently we still had a corrupted index and hence would like to do this additional validation.