Restore from snapshot fails with no recovery information

I have 2 similarly configured 3 node elasticsearch clusters. Both run 7.10.2, but one is bigger than the other.

I'm trying to restore an index from a snapshot. In the small cluster, the restore takes about 3 minutes, during which the cluster is yellow. In the big cluster, starting the restore causes the cluster to go red. I waited some time, but had to delete the restored index while it was red so the cluster would be up.

Both clusters have a repository with scheduled backups.
GET /_cat/snapshots/gcs_repository?v
Shows 4 snapshots, all have status = "SUCCESS".
To give an idea of the size, the small cluster has duration between 1-2 minutes. The large cluster has 4-12 minutes.

POST /_snapshot/gcs_repository/_verify
Shows 3 nodes.

GET /_snapshot/gcs_repository/
Shows the repository, the only difference here is that the large cluster has "max_snapshot_bytes_per_sec" : "320mb".

When I start the restore, I issue a command like:

POST /_snapshot/gcs_repository/daily-backup-0/_restore
{
  "indices": "foo-000001",
  "rename_pattern": "foo-000001",
  "rename_replacement": "foo-test0"
}

Both clusters respond with "accepted".
The small cluster's health goes to yellow, because there is a new index foo-test0 that is yellow. The large cluster goes red, as is the new index.

Then I check the status with
GET /foo-test0/_recovery
And the small cluster responds with a bunch of details about the recovery process, including the percent done. Awesome.

But the large cluster responds with
{ }

Since the large cluster is being used, I can't leave it red for long, so i delete the new index and it goes back to green.

  • Any ideas what is going on?
  • Is there any command to check the status of my snapshots for damage?
  • Is there a way to restore the snapshot without affecting the cluster's health?

Thanks for any help, Bill

So that indicates there's no recoveries going on, presumably because they've failed quickly. I would expect there to be helpful details in the logs, but the cluster allocation explain API is always the best way to diagnose non-green health.

2 Likes

Thanks very much David. I totally forgot to check the logs!

There were 4 lines of output each time I tried to restore. The first three end with:
node [o3XmuKUVRwi_nL9WIy0LSA] would have less than the required threshold of 0b free (currently 46.4gb free, estimated shard size is 131.4gb), preventing allocation

So it's pretty clear what's wrong.
Thank you!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.