Hey,
We are trying to restore a snapshot on a remote cluster of ours.
We have created an S3 snapshot repo in the source cluster, took the snapshot and then we restore this snapshot on a remote es cluster.
We are witnessing a problem that from time to time our snapshot is "stuck" and does not register on the remote cluster. When we check the _status of the snapshot in the source cluster we get that its state is "SUCCESS", and that all of the shards are in stage "DONE".
When we GET the snapshot itself (without using _status) we get that the state is in "IN_PROGRESS", but all of the shards are successful.
Why is this the case that snapshot is stuck on "IN_PROGRESS" in the source cluster although its done?
This means it's copied all the shards' data, but there's one last finalization step remaining in order to write the corresponding metadata in the repository. Finalizations happen one-at-a-time and might have to wait for other activities in other snapshots too.
Hmm AFAIK we should never have a snapshot in state IN_PROGRESS with shards having completed their work unless the finalization is in the master's pending task queue. So I suspect a bug then. Strange that it's never caused any test failures tho, we have quite a lot of tests to catch this sort of thing.
Is your cluster at all unstable? Nodes leaving and rejoining, master elections etc? Can you reproduce this problem if you set snapshot.max_concurrent_operations: 1?
If you trigger a master failover (e.g. restart the elected master node) then does the snapshot complete?
So its not hard for me to reproduce this even without changing that parameter, it pretty much happens all the time.
I prefer not to change any cluster parameter that will require a cluster restart nor do I prefer to restart a master node.
Looking at the logs (info+) of the master node I can see only that it has only 2 logs referencing the snapshot, 1 that it started and 1 an hour later that it completed (although the snapshot was virtually done almost immediately).
Before I tend to these harsher actions, is there anything else I can check to be better informed?
You can set snapshot.max_concurrent_operations via PUT _cluster/settings.
Ah ok so it's not really stuck, it's just taking longer than you'd like to do the finalization work. I could believe that finalization takes an hour or so in 7.17 -- I haven't looked at this code in a while but it's definitely doing things differently from 8.x.
Hey,
I tried to provoke a failover of the master node and it did help finalize the snapshot almost immediately after the new master was elected.
But I keep getting this error now on the new master node (even when the master node is a brand new instance).
I cannot just trigger failovers whenever I preform a snapshot.
Digging further I understood that I was mistaken about the fact that there is another snapshot operation queued. It didn't show in the _snapshot api as its a delete operation. This operation takes a long while, and I see that there is a large amount of snapshot threads queued on the master node.
Other than increasing the thread pool is there a way for me to prioritize my snapshot in the cluster? The delete operation takes a long while as this cluster has many many shards being snapshotted (there is currently 5000 pending tasks in the snapshot queue).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.