Snapshot not registering on remote cluster

Rokni · June 16, 2024, 2:19pm

Hey,
We are trying to restore a snapshot on a remote cluster of ours.
We have created an S3 snapshot repo in the source cluster, took the snapshot and then we restore this snapshot on a remote es cluster.

We are witnessing a problem that from time to time our snapshot is "stuck" and does not register on the remote cluster. When we check the _status of the snapshot in the source cluster we get that its state is "SUCCESS", and that all of the shards are in stage "DONE".
When we GET the snapshot itself (without using _status) we get that the state is in "IN_PROGRESS", but all of the shards are successful.

Why is this the case that snapshot is stuck on "IN_PROGRESS" in the source cluster although its done?

We use ES 7.17.6

DavidTurner · June 16, 2024, 2:52pm

This means it's copied all the shards' data, but there's one last finalization step remaining in order to write the corresponding metadata in the repository. Finalizations happen one-at-a-time and might have to wait for other activities in other snapshots too.

Rokni · June 16, 2024, 2:58pm

There aren't any other snapshots running, is there a way for me to get some clearer indications on that step?

DavidTurner · June 16, 2024, 3:18pm

What does GET _cluster/pending_tasks return when the system seems stuck?

Rokni · June 16, 2024, 6:20pm

There aren't any pending tasks, the list is empty

DavidTurner · June 16, 2024, 6:27pm

Hmm AFAIK we should never have a snapshot in state IN_PROGRESS with shards having completed their work unless the finalization is in the master's pending task queue. So I suspect a bug then. Strange that it's never caused any test failures tho, we have quite a lot of tests to catch this sort of thing.

Is your cluster at all unstable? Nodes leaving and rejoining, master elections etc? Can you reproduce this problem if you set snapshot.max_concurrent_operations: 1?

If you trigger a master failover (e.g. restart the elected master node) then does the snapshot complete?

Rokni · June 17, 2024, 1:44pm

So its not hard for me to reproduce this even without changing that parameter, it pretty much happens all the time.

I prefer not to change any cluster parameter that will require a cluster restart nor do I prefer to restart a master node.

Looking at the logs (info+) of the master node I can see only that it has only 2 logs referencing the snapshot, 1 that it started and 1 an hour later that it completed (although the snapshot was virtually done almost immediately).

Before I tend to these harsher actions, is there anything else I can check to be better informed?

DavidTurner · June 17, 2024, 2:30pm

You can set snapshot.max_concurrent_operations via PUT _cluster/settings.

Ah ok so it's not really stuck, it's just taking longer than you'd like to do the finalization work. I could believe that finalization takes an hour or so in 7.17 -- I haven't looked at this code in a while but it's definitely doing things differently from 8.x.

Rokni · June 19, 2024, 11:29am

Hey,
I tried to provoke a failover of the master node and it did help finalize the snapshot almost immediately after the new master was elected.
But I keep getting this error now on the new master node (even when the master node is a brand new instance).
I cannot just trigger failovers whenever I preform a snapshot.

Digging further I understood that I was mistaken about the fact that there is another snapshot operation queued. It didn't show in the _snapshot api as its a delete operation. This operation takes a long while, and I see that there is a large amount of snapshot threads queued on the master node.

Other than increasing the thread pool is there a way for me to prioritize my snapshot in the cluster? The delete operation takes a long while as this cluster has many many shards being snapshotted (there is currently 5000 pending tasks in the snapshot queue).

DavidTurner · June 19, 2024, 12:26pm

No, there's no way to affect the priority of these things.

When you say you're using a S3 repository, do you mean AWS S3 or is it something from a third-party which claims to be S3-compatible?

Rokni · June 19, 2024, 12:58pm

AWS S3.

Im attempting to beef up my master nodes in order for them to chew that queue much faster.

Rokni · June 20, 2024, 10:10am

So we increased by a long shot the snapshot thread pool and it solved it quickly.

Topic		Replies	Views
Curator snapshot stuck in INIT Elasticsearch	6	664	June 23, 2019
Snapshot doesn't complete with "failed to finalize snapshot" Elasticsearch	3	231	March 21, 2024
Snapshot got stuck in IN_PROGRESS state Elasticsearch snapshot-and-restore	2	840	May 6, 2021
Snapshot in s3 bucket not seen by other cluster Elasticsearch snapshot-and-restore	4	856	November 5, 2021
ES snapshots Elasticsearch	11	863	July 6, 2017

Snapshot not registering on remote cluster

Related Topics