S3 snapshot repository cooldown period unnecessary?

I'm currently looking at upgrading some 6.8 clusters to 7.17, and noticed the artificial 3 minute wait on create/delete snapshot operations on an S3 repository that appears to have been added sometime in early 2020. I don't understand the finer details of potential metadata inconsistency, but I have a question for the Elastic folks: now that S3 is strongly consistent (as of Dec 2020), is this cooldown period still necessary at all for repos with pre-7.6 snapshots, and if so why?

My plan is to set the cooldown period to 0 and test with that, so I'd like to understand if there is any potential risk in doing so, particularly as we have been doing just fine with cooldown-free snapshots on 6.8 for several years.

Update: kind of a separate issue, but after setting the cooldown to 0, I find that snapshot deletion is still taking rather a long time (~2 minutes) to delete just a single snapshot containing a single index. This is way longer than it used to take on 6.8 (5 or 6 seconds). It appears as though it's spending most of the time listing all of the root blobs in the S3 repository bucket, essentially with the purpose of finding stale blobs. Thing is, in my current upgrade test, as far as I'm aware it hasn't ever found any stale blobs to delete, so I'm wondering why I'm paying the price of listing the root blobs every single time I want to delete a snapshot. What's the harm of having stale blobs lying around (if that's even likely to happen), or of only cleaning them up at some arbitrary periodicity instead of on every single snapshot delete?

Giving a definite answer to this would unfortunately take more analysis than is likely to happen given that all the involved versions are past EOL these days. Even if it were not necessary any more with the real S3, we wouldn't just completely remove this delay from Elasticsearch since there are other supposedly-S3-compatible systems which still don't have such strong consistency guarantees.

Yes it's not an easy bug to reproduce, you have to be quite unlucky to hit it. But the consequences are pretty severe, hence the abundance of caution.

I think this also only happens in cases involving versions past EOL, which unfortunately don't get much optimisation effort.

The simplest way to speed things up here is to move to a new repository after upgrading.

Ok, fair enough... it's simple enough to disable the cooldown until all of our 6.x snapshots are replaced.

I think this also only happens in cases involving versions past EOL, which unfortunately don't get much optimisation effort.

The simplest way to speed things up here is to move to a new repository after upgrading.

My interpretation of the code in BlobStoreRepository.deleteSnapshots is that it will always list all root blobs, regardless of repositoryMetaVersion. The unlinked root blob cleanup (which seems to be the only thing that actually needs the full root blob listing) also appears to be called for both 'new' and 'old' repos.

Based on that interpretation, my inclination is to disable that cleanup on snapshot deletion and maybe just call the repository cleanup API periodically as a housekeeping task... at least I think that would suit my use case, and I wonder if you folks would be open to a PR that makes the cleanup-stale-blobs-on-snapshot-delete behaviour determined by a new setting (enabled by default to preserve current behaviour)?

I guess I'd also still like to understand if there is any risk posed by the existence of stale root blobs (apart from improbable UUID clashes?).

@Armin_Braun sorry to bother, but would you be able to comment on the above proposal on having some way of not doing a full root blob listing on every snapshot delete? I'm happy to just submit a PR on GitHub if you think it's reasonable or have any alternative ideas.

@bpiper I'm not sure I understand why the stale blob listing is an issue for you in the first place.
The reason we eventually made the listing and cleanup run on every delete was that our testing showed it to be very cheap/fast. Even when benchmarking repositories with thousands of snapshots I have never found the listing operation to be slow.

I am aware that the deletes themselves can take quite a long time when cleaning up stale indices but listing and maybe cleaning up from the repo root should be trivial in overhead. The listing returns a thousand blobs per API call so even if you have thousands of snapshots it should rarely take more than a second.
How many snapshots do you have in your repository? Do you by any chance have any unrelated blobs that have nothing to do with the ES repository in the repository path maybe? How did you measure that listing blobs is the slow thing on your end?

1 Like