Snapshot repository missing after remounting network disk

Hi All,

We decided to increase the amount of storage we have for snapshots On a three node cluster elastic cluster v7.

We did that by copying the contents of the old disk to the bigger disk. When copying was done, we unmounted the old disk and mounted the new one while the cluster was running. The nfs unmounted the disk, so we had to remount it.

We have the shared disk on one of the nodes and share the repository using nfs.

Now, when I request

GET /_cat/snapshots

I get the following error

{
  "error" : {
    "root_cause" : [
      {
        "type" : "action_request_validation_exception",
        "reason" : "Validation Failed: 1: repository is missing;"
      }
    ],
    "type" : "action_request_validation_exception",
    "reason" : "Validation Failed: 1: repository is missing;"
  },
  "status" : 400
}

I can now access the storage on the three servers just fine. However, I did not restart the servers yet.
Any idea how to solve this?

You need to restart every node, since you have three nodes, you can do a rolling restart to avoid downtime of the cluster.

Hi Thanks for your reply!

I restarted the cluster. However GET /_cat/snapshots still gives the same response

{
  "error" : {
    "root_cause" : [
      {
        "type" : "action_request_validation_exception",
        "reason" : "Validation Failed: 1: repository is missing;"
      }
    ],
    "type" : "action_request_validation_exception",
    "reason" : "Validation Failed: 1: repository is missing;"
  },
  "status" : 400
}

However, now, getting the snapshots as follows GET /_snapshot/repo_name/snap_name works fine

Is it possible that the GET /_cat/snapshots is only introduced after Elasticsearch v7, although it is complaining about a repository.

EDIT: No, a snapshot failed yesterday citing an internal error. I already tried a rolling cluster restart and full restart.

here is the repo part in Elasticsearch.yml file

path.repo: ["/esdata/nfs/elasticsearch/backups"]

When I inspect the docker container for mounts, here is the path
Node 1:

                "Type": "bind",
                "Source": "/esdata/nfs/elasticsearch",
                "Destination": "/esdata/nfs/elasticsearch",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"

Node 2:

                "Type": "bind",
                "Source": "/esdata/nfs/elasticsearch",
                "Destination": "/esdata/nfs/elasticsearch",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"

Node 3:

                "Type": "bind",
                "Source": "/esdata/nfs/elasticsearch",
                "Destination": "/esdata/nfs/elasticsearch",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"

When I verify my repositories, using the verify API POST /_snapshot/my_repository/_verify, it works fine. All nodes show up.

I can also create repositories.

I just checked the snapshot status, it seems the failure was due to the shutdown


       {
          "index" : "index-2022.05.17",
          "index_uuid" : "index-2022.05.17",
          "shard_id" : 3,
          "reason" : "node shutdown",
          "node_id" : "---------------------------",
          "status" : "INTERNAL_SERVER_ERROR"
        }

However, the repository error is still unexplained?

The endpoint GET _cat/snapshots exists since version 6.X at least.

What do you have in Elasticsearch logs?

Is the snapshot registered and mounted in all three nodes?

Hi, Thanks for the reply!

How can I make sure that the snapshot is registered and mounted on all nodes? I checked through Kibana that all repositories are registered. I even deleted the repositories and registered them again as shared location. it was able to find all snapshots. Repository verification also works for all repositories.

I also tried to delete/create/ and restore snapshots.

In Elastic logs, I found the following error:

[2022-05-17T12:16:47,077][ERROR][o.e.x.s.SnapshotLifecycleTask] [node2]failed to create snapshot for snapshot lifecycle policy [index-daily-snapshot]: SnapshotException[[backup-v7-index:index-2022.05.16------/-----------] failed to update snapshot in repository]; nested: ElasticsearchException[failed to create blob container]; nested: FileSystemException[/esdata/nfs/elasticsearch/backups/v7/index/live/indices: Stale file handle];
uncaught exception in thread [main]
java.lang.IllegalStateException: Unable to access 'path.repo' (/esdata/nfs/elasticsearch/backups)
Likely root cause: java.nio.file.AccessDeniedException: /esdata/nfs/elasticsearch/backups
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:90)
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126)
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
For complete error details, refer to the log at /usr/share/elasticsearch/logs/es-cluster.log
uncaught exception in thread [main]

however, when I check the access rights it is as follows:
for /esdata/nfs:

drwxr-xr-x  6 user user 4096 May 17 12:58 nfs

for /esdata/nfs/Elasticsearch:

drwxrwxr-x 3 adminuser adminuser 4096 Aug 10  2020 elasticsearch

for /esdata/nfs/Elasticsearch/backups:

drwxrwxr-x 7 adminuser adminuser 4096 Jul 13  2021 backups

I cannot find the logs mentioned (/usr/share/Elasticsearch/logs/es-cluster.log) as it seem is not written.
However, I believe those errors are back from when I was restarting the nodes.
I am finding it hard to believe it is an access issue as elastic can write and read the snaps.

Also, SLM does not seem to be complaining, when I run GET _slm/status, it returns as RUNNING

just a friendly reminder as I still have this issue.

Never saw this issue and I do not run docker, so I'm not sure if it is related to docker or note, but this log line could give some hint:

FileSystemException[/esdata/nfs/elasticsearch/backups/v7/index/live/indices: Stale file handle

Since you unmounted and mounted the repository while the cluster was running, this could have caused some issue. Do you had some running snapshots that maybe were running while the mounting/umounting happened?

The user running Elasticsearch is the same that owns the paths?

You will need to see if someone from Elastic can give more context about what this means, but just a remind that there is no SLA in this forum.

Thanks a ton for the reply.

It is very likely that there was a running snapshot when mounting/unmounting happened unfortunately.

Yes it is the same user

I understand that and I appreciate your help!

Thanks again!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.