I'm currently looking at upgrading some 6.8 clusters to 7.17, and noticed the artificial 3 minute wait on create/delete snapshot operations on an S3 repository that appears to have been added sometime in early 2020. I don't understand the finer details of potential metadata inconsistency, but I have a question for the Elastic folks: now that S3 is strongly consistent (as of Dec 2020), is this cooldown period still necessary at all for repos with pre-7.6 snapshots, and if so why?
My plan is to set the cooldown period to 0 and test with that, so I'd like to understand if there is any potential risk in doing so, particularly as we have been doing just fine with cooldown-free snapshots on 6.8 for several years.
Update: kind of a separate issue, but after setting the cooldown to 0, I find that snapshot deletion is still taking rather a long time (~2 minutes) to delete just a single snapshot containing a single index. This is way longer than it used to take on 6.8 (5 or 6 seconds). It appears as though it's spending most of the time listing all of the root blobs in the S3 repository bucket, essentially with the purpose of finding stale blobs. Thing is, in my current upgrade test, as far as I'm aware it hasn't ever found any stale blobs to delete, so I'm wondering why I'm paying the price of listing the root blobs every single time I want to delete a snapshot. What's the harm of having stale blobs lying around (if that's even likely to happen), or of only cleaning them up at some arbitrary periodicity instead of on every single snapshot delete?
Giving a definite answer to this would unfortunately take more analysis than is likely to happen given that all the involved versions are past EOL these days. Even if it were not necessary any more with the real S3, we wouldn't just completely remove this delay from Elasticsearch since there are other supposedly-S3-compatible systems which still don't have such strong consistency guarantees.
Yes it's not an easy bug to reproduce, you have to be quite unlucky to hit it. But the consequences are pretty severe, hence the abundance of caution.
I think this also only happens in cases involving versions past EOL, which unfortunately don't get much optimisation effort.
The simplest way to speed things up here is to move to a new repository after upgrading.
Ok, fair enough... it's simple enough to disable the cooldown until all of our 6.x snapshots are replaced.
I think this also only happens in cases involving versions past EOL, which unfortunately don't get much optimisation effort.
The simplest way to speed things up here is to move to a new repository after upgrading.
My interpretation of the code in BlobStoreRepository.deleteSnapshots is that it will always list all root blobs, regardless of repositoryMetaVersion. The unlinked root blob cleanup (which seems to be the only thing that actually needs the full root blob listing) also appears to be called for both 'new' and 'old' repos.
Based on that interpretation, my inclination is to disable that cleanup on snapshot deletion and maybe just call the repository cleanup API periodically as a housekeeping task... at least I think that would suit my use case, and I wonder if you folks would be open to a PR that makes the cleanup-stale-blobs-on-snapshot-delete behaviour determined by a new setting (enabled by default to preserve current behaviour)?
I guess I'd also still like to understand if there is any risk posed by the existence of stale root blobs (apart from improbable UUID clashes?).
@Armin_Braun sorry to bother, but would you be able to comment on the above proposal on having some way of not doing a full root blob listing on every snapshot delete? I'm happy to just submit a PR on GitHub if you think it's reasonable or have any alternative ideas.
@bpiper I'm not sure I understand why the stale blob listing is an issue for you in the first place.
The reason we eventually made the listing and cleanup run on every delete was that our testing showed it to be very cheap/fast. Even when benchmarking repositories with thousands of snapshots I have never found the listing operation to be slow.
I am aware that the deletes themselves can take quite a long time when cleaning up stale indices but listing and maybe cleaning up from the repo root should be trivial in overhead. The listing returns a thousand blobs per API call so even if you have thousands of snapshots it should rarely take more than a second.
How many snapshots do you have in your repository? Do you by any chance have any unrelated blobs that have nothing to do with the ES repository in the repository path maybe? How did you measure that listing blobs is the slow thing on your end?
@Armin_Braun we have hundreds of thousands of snapshots, and no unrelated blobs in the relevant S3 bucket path.
We have logging that effectively records the time taken for snapshot deletes, and I could also observe from thread dumps that Elasticsearch was spending most of that time listing root blobs. In the environment I was testing with, there are roughly 330K blobs, and listing them with the AWS CLI takes about 120 - 180 seconds.
So yes, this is definitely a couple of orders of magnitude greater than what you're probably used to, and what most people would probably have in their snapshot repos. The way we use snapshots is kind of analogous to S3 storage tiering, e.g. we have live indices that customers can query quickly, and then indices that haven't been queried in a while that get automatically archived (so they're not a burden on the cluster) and restored on-demand if customers want to look at their old data.
I won't go into detail on why we have so many indices in total, but suffice to say there are reasons why they aren't combined into fewer indices (for a start it's not time-series data)... so basically we try to keep the number of live indices to a relatively low/stable level, but equally it's important that the time cost of snapshot operations is either sub-linear in proportion to the total number of snapshots, or at least low enough not to be a concern. This was the case in 6.8.x, but as mentioned above the behaviour seems to have changed in 7.
I can appreciate that the snapshot system probably wasn't written with the expectation that it would be handling more than thousands of snapshots in a given repo, although even for a repo with just 10,000 snapshots, the root blob listing might still take around 4 seconds (interpolating from my experience), which could be considered a significant cost for some use cases. If there's no reason to do the root blob listing on snapshot deletion other than for cleaning up orphaned blobs that are otherwise harmless (?), then it would be convenient to have the option of disabling that and putting the responsibility of repository cleanup (using the cleanup API) on whomever disables it.
I'd be happy to submit a PR for the above... or I'm open to other possible solutions.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.