Over the course of time, a number of failed snapshots have occurred for various reasons, which were out of the control of ES. I am now left with what I believe are a large number of orphaned lucene bits inside the shard directories of each of my indexes.
I've been looking over how snapshots are stored, and I believe that I understand it but wanted to see if I could get someone to confirm before I continue moving forward.
In a shard directory, I have files like __0 and __1, as well as the snapshot JSON files which have a list of files that are referenced for that particular snapshot. If I were to combine all the snapshot file lists, I should have a complete list of all files in that shard's directory to restore any of the snapshots.
The issue is that some of my shards have many files which are not referenced by these snapshot files. My assumption is that these are what are leftover from previous failed backups, and removing them will not impede my ability to restore the snapshots that are good.
I am going to do some small scale testing of my theory, but doing so at large scale is not feasible since my total data set is 60TB+.
Anyone out there who knows about this at a more advanced level than I? I've not had much luck finding information on this, but have not yet dove into the source code.