Snapshot repository missing after remounting network disk

elaskibuser · May 17, 2022, 11:01am

Hi All,

We decided to increase the amount of storage we have for snapshots On a three node cluster elastic cluster v7.

We did that by copying the contents of the old disk to the bigger disk. When copying was done, we unmounted the old disk and mounted the new one while the cluster was running. The nfs unmounted the disk, so we had to remount it.

We have the shared disk on one of the nodes and share the repository using nfs.

Now, when I request

GET /_cat/snapshots

I get the following error

{
  "error" : {
    "root_cause" : [
      {
        "type" : "action_request_validation_exception",
        "reason" : "Validation Failed: 1: repository is missing;"
      }
    ],
    "type" : "action_request_validation_exception",
    "reason" : "Validation Failed: 1: repository is missing;"
  },
  "status" : 400
}

I can now access the storage on the three servers just fine. However, I did not restart the servers yet.
Any idea how to solve this?

leandrojmp · May 17, 2022, 12:21pm

You need to restart every node, since you have three nodes, you can do a rolling restart to avoid downtime of the cluster.

elaskibuser · May 17, 2022, 6:44pm

Hi Thanks for your reply!

I restarted the cluster. However GET /_cat/snapshots still gives the same response

{
  "error" : {
    "root_cause" : [
      {
        "type" : "action_request_validation_exception",
        "reason" : "Validation Failed: 1: repository is missing;"
      }
    ],
    "type" : "action_request_validation_exception",
    "reason" : "Validation Failed: 1: repository is missing;"
  },
  "status" : 400
}

However, now, getting the snapshots as follows GET /_snapshot/repo_name/snap_name works fine

elaskibuser · May 18, 2022, 2:34pm

Is it possible that the GET /_cat/snapshots is only introduced after Elasticsearch v7, although it is complaining about a repository.

EDIT: No, a snapshot failed yesterday citing an internal error. I already tried a rolling cluster restart and full restart.

here is the repo part in Elasticsearch.yml file

path.repo: ["/esdata/nfs/elasticsearch/backups"]

When I inspect the docker container for mounts, here is the path
Node 1:

                "Type": "bind",
                "Source": "/esdata/nfs/elasticsearch",
                "Destination": "/esdata/nfs/elasticsearch",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"

Node 2:

                "Type": "bind",
                "Source": "/esdata/nfs/elasticsearch",
                "Destination": "/esdata/nfs/elasticsearch",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"

Node 3:

                "Type": "bind",
                "Source": "/esdata/nfs/elasticsearch",
                "Destination": "/esdata/nfs/elasticsearch",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"

When I verify my repositories, using the verify API POST /_snapshot/my_repository/_verify, it works fine. All nodes show up.

I can also create repositories.

elaskibuser · May 18, 2022, 3:29pm

I just checked the snapshot status, it seems the failure was due to the shutdown


       {
          "index" : "index-2022.05.17",
          "index_uuid" : "index-2022.05.17",
          "shard_id" : 3,
          "reason" : "node shutdown",
          "node_id" : "---------------------------",
          "status" : "INTERNAL_SERVER_ERROR"
        }

However, the repository error is still unexplained?

leandrojmp · May 18, 2022, 7:02pm

The endpoint GET _cat/snapshots exists since version 6.X at least.

What do you have in Elasticsearch logs?

Is the snapshot registered and mounted in all three nodes?

elaskibuser · May 18, 2022, 9:49pm

Hi, Thanks for the reply!

How can I make sure that the snapshot is registered and mounted on all nodes? I checked through Kibana that all repositories are registered. I even deleted the repositories and registered them again as shared location. it was able to find all snapshots. Repository verification also works for all repositories.

I also tried to delete/create/ and restore snapshots.

In Elastic logs, I found the following error:

[2022-05-17T12:16:47,077][ERROR][o.e.x.s.SnapshotLifecycleTask] [node2]failed to create snapshot for snapshot lifecycle policy [index-daily-snapshot]: SnapshotException[[backup-v7-index:index-2022.05.16------/-----------] failed to update snapshot in repository]; nested: ElasticsearchException[failed to create blob container]; nested: FileSystemException[/esdata/nfs/elasticsearch/backups/v7/index/live/indices: Stale file handle];
uncaught exception in thread [main]
java.lang.IllegalStateException: Unable to access 'path.repo' (/esdata/nfs/elasticsearch/backups)
Likely root cause: java.nio.file.AccessDeniedException: /esdata/nfs/elasticsearch/backups
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126)
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
For complete error details, refer to the log at /usr/share/elasticsearch/logs/es-cluster.log
uncaught exception in thread [main]

however, when I check the access rights it is as follows:
for /esdata/nfs:

drwxr-xr-x  6 user user 4096 May 17 12:58 nfs

for /esdata/nfs/Elasticsearch:

drwxrwxr-x 3 adminuser adminuser 4096 Aug 10  2020 elasticsearch

for /esdata/nfs/Elasticsearch/backups:

drwxrwxr-x 7 adminuser adminuser 4096 Jul 13  2021 backups

I cannot find the logs mentioned (/usr/share/Elasticsearch/logs/es-cluster.log) as it seem is not written.
However, I believe those errors are back from when I was restarting the nodes.
I am finding it hard to believe it is an access issue as elastic can write and read the snaps.

elaskibuser · May 20, 2022, 12:05pm

Also, SLM does not seem to be complaining, when I run GET _slm/status, it returns as RUNNING

just a friendly reminder as I still have this issue.

leandrojmp · May 20, 2022, 12:33pm

Never saw this issue and I do not run docker, so I'm not sure if it is related to docker or note, but this log line could give some hint:

FileSystemException[/esdata/nfs/elasticsearch/backups/v7/index/live/indices: Stale file handle

Since you unmounted and mounted the repository while the cluster was running, this could have caused some issue. Do you had some running snapshots that maybe were running while the mounting/umounting happened?

The user running Elasticsearch is the same that owns the paths?

You will need to see if someone from Elastic can give more context about what this means, but just a remind that there is no SLA in this forum.

elaskibuser · May 20, 2022, 12:50pm

Thanks a ton for the reply.

It is very likely that there was a running snapshot when mounting/unmounting happened unfortunately.

Yes it is the same user

I understand that and I appreciate your help!

Thanks again!

system · June 17, 2022, 12:51pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error when creating a Snapshot Repository Elasticsearch	3	956	November 13, 2018
Listing stored snapshots - RepositoryMissingException - New, empty cluster Elasticsearch	5	1708	July 6, 2017
Repository_missing_exception with existing repo Elasticsearch	1	7627	December 26, 2017
Snapshot 'repository_missing_exception' Elasticsearch	1	3292	August 28, 2017
Backup repository missing exception Elasticsearch	8	10842	August 2, 2018

Snapshot repository missing after remounting network disk

Related topics