Snapshot not registering on remote cluster

Hey,
We are trying to restore a snapshot on a remote cluster of ours.
We have created an S3 snapshot repo in the source cluster, took the snapshot and then we restore this snapshot on a remote es cluster.

We are witnessing a problem that from time to time our snapshot is "stuck" and does not register on the remote cluster. When we check the _status of the snapshot in the source cluster we get that its state is "SUCCESS", and that all of the shards are in stage "DONE".
When we GET the snapshot itself (without using _status) we get that the state is in "IN_PROGRESS", but all of the shards are successful.

Why is this the case that snapshot is stuck on "IN_PROGRESS" in the source cluster although its done?

We use ES 7.17.6

This means it's copied all the shards' data, but there's one last finalization step remaining in order to write the corresponding metadata in the repository. Finalizations happen one-at-a-time and might have to wait for other activities in other snapshots too.

There aren't any other snapshots running, is there a way for me to get some clearer indications on that step?

What does GET _cluster/pending_tasks return when the system seems stuck?

There aren't any pending tasks, the list is empty

Hmm AFAIK we should never have a snapshot in state IN_PROGRESS with shards having completed their work unless the finalization is in the master's pending task queue. So I suspect a bug then. Strange that it's never caused any test failures tho, we have quite a lot of tests to catch this sort of thing.

Is your cluster at all unstable? Nodes leaving and rejoining, master elections etc? Can you reproduce this problem if you set snapshot.max_concurrent_operations: 1?

If you trigger a master failover (e.g. restart the elected master node) then does the snapshot complete?

So its not hard for me to reproduce this even without changing that parameter, it pretty much happens all the time.

I prefer not to change any cluster parameter that will require a cluster restart nor do I prefer to restart a master node.

Looking at the logs (info+) of the master node I can see only that it has only 2 logs referencing the snapshot, 1 that it started and 1 an hour later that it completed (although the snapshot was virtually done almost immediately).

Before I tend to these harsher actions, is there anything else I can check to be better informed?

You can set snapshot.max_concurrent_operations via PUT _cluster/settings.

Ah ok so it's not really stuck, it's just taking longer than you'd like to do the finalization work. I could believe that finalization takes an hour or so in 7.17 -- I haven't looked at this code in a while but it's definitely doing things differently from 8.x.

Hey,
I tried to provoke a failover of the master node and it did help finalize the snapshot almost immediately after the new master was elected.
But I keep getting this error now on the new master node (even when the master node is a brand new instance).
I cannot just trigger failovers whenever I preform a snapshot.

Digging further I understood that I was mistaken about the fact that there is another snapshot operation queued. It didn't show in the _snapshot api as its a delete operation. This operation takes a long while, and I see that there is a large amount of snapshot threads queued on the master node.

Other than increasing the thread pool is there a way for me to prioritize my snapshot in the cluster? The delete operation takes a long while as this cluster has many many shards being snapshotted (there is currently 5000 pending tasks in the snapshot queue).

No, there's no way to affect the priority of these things.

When you say you're using a S3 repository, do you mean AWS S3 or is it something from a third-party which claims to be S3-compatible?

AWS S3.

Im attempting to beef up my master nodes in order for them to chew that queue much faster.

So we increased by a long shot the snapshot thread pool and it solved it quickly.