Snapshot partial failure on cluster

I am running into an issue when taking a snapshot of our cluster. For some odd reason, the snapshot isn't fully completing and I am seeing a partial status after the snapshot completes. The node ID listed for every index is happening on Node1 of the cluster however I am seeing the index IDs be created on the NFS mount/file share location for every node (including Node1) so I am not sure why this happening. I verified access on the mount and elasticsearch has the necessary access to write to all nodes in the cluster. The logs only show that node1 cannot access the mount. I have attached my findings below-hoping for any kind of help :slight_smile:

cluster setup is 2 data nodes, 2 master nodes and 1coordinating node. (node 1 is a data node)

Dev tools output after running snapshot
reason" : "IndexShardSnapshotFailedException[Failed to snapshot]; nested: ElasticsearchException[failed to create blob container]; nested: AccessDeniedException[/data/disk3/elastic_snapshot/indices/j7EykkgiTICux52ziejbFA/0]

Cluster log from node 1
[elastic_snapshot] failed to verify repository
org.elasticsearch.repositories.RepositoryVerificationException: [elastic_snapshot] store location [/data/disk3/elastic_snapshot] is not accessible on the node

It seems likely that this isn't the case. A common problem with NFS-based shared filesystem repositories is that access is determined by user ID by default. You might be running Elasticsearch as the same named user on each node, but if these users have different user IDs then they will have inconsistent permissions in the repository.

If you need more help after checking this, please could you share the whole stack trace of the error in the cluster log?

Thanks for your response @DavidTurner, can you provide how I can check that? This was a new process for me so I am learning as I go. I’ll also work to provide the full stack logs

Not really, sorry, setting up a NFS shared filesystem is outside of my area of expertise.

Here is the full error from the stack log @DavidTurner . I am still looking into it on my end as well. The snapshot is showing as a partial state and I am seeing the indicies in the location specified so I am really confused why this is happening. The permissions on the indices folder was modified as a test with chmod 777. I also have copied the failed shard count below the logs.

[2019-05-09T08:25:36,365][WARN ][o.e.s.SnapshotShardsService] [server.com] [[.security_audit_log-2019.04.21][0]][elastic_snapshot:upgrade_snapshot/lQv50ZWnRrKEuZ1aljgY0A] failed to snapshot shard
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to snapshot
        at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:420) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.snapshots.SnapshotShardsService.access$300(SnapshotShardsService.java:97) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:354) [elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.5.4.jar:6.5.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
Caused by: org.elasticsearch.ElasticsearchException: failed to create blob container
        at org.elasticsearch.common.blobstore.fs.FsBlobStore.blobContainer(FsBlobStore.java:72) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.<init>(BlobStoreRepository.java:947) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.<init>(BlobStoreRepository.java:940) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.<init>(BlobStoreRepository.java:1168) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:851) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:410) ~[elasticsearch-6.5.4.jar:6.5.4]
        ... 7 more
Caused by: java.nio.file.AccessDeniedException: /data/disk3/elastic_snapshot/indices/Bm9HgASzRdmGlPR_wbiEBQ/0
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) ~[?:?]
        at java.nio.file.Files.createDirectory(Files.java:674) ~[?:1.8.0_191]
        at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) ~[?:1.8.0_191]
        at java.nio.file.Files.createDirectories(Files.java:767) ~[?:1.8.0_191]
        at org.elasticsearch.common.blobstore.fs.FsBlobStore.buildAndCreate(FsBlobStore.java:89) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.common.blobstore.fs.FsBlobStore.blobContainer(FsBlobStore.java:70) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.<init>(BlobStoreRepository.java:947) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$Context.<init>(BlobStoreRepository.java:940) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.<init>(BlobStoreRepository.java:1168) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:851) ~[elasticsearch-6.5.4.jar:6.5.4]
        at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:410) ~[elasticsearch-6.5.4.jar:6.5.4]
        ... 7 more                                 

],
"shards" : {
"total" : 216,
"failed" : 216,
"successful" : 0
}
}
]
}

One thing to note.....I am seeing my permissions get overwritten by the Kibana user. Not sure if that is a big deal or part of the problem but something to note as well.

-rw-r--r--.   1 elasticsearch elasticsearch   29 May  9 08:47 index-5
-rw-r--r--.   1 kibana        kibana         11K May  9 08:59 index-6
-rw-r--r--.   1 kibana        kibana           8 May  9 08:59 index.latest
drwxrwxrwx. 111 elasticsearch elasticsearch 4.0K May  9 08:59 indices
-rw-r--r--.   1 kibana        kibana         98K May  9 08:59 meta-ceo4_PhxQF-y-N2uQyFW5w.dat
-rw-r--r--.   1 kibana        kibana         72K May  9 08:59 snap-ceo4_PhxQF-y-N2uQyFW5w.da

Yes, this all points towards inconsistent user ids. I would check the output of id elasticsearch and id kibana on each node, and I think you will find one node with username kibana that shares a user id with another node's elasticsearch.

Good call! It looks like we are seeing matched IDs for Elasticsearch and Kibana from two different servers. I just need to figure out how to change them (all new stuff to me) I will keep you updated with what I do to fix this but i appreciate you pointing me in the right direction. Thanks again!

DLPV1:
uid=995(elasticsearch) gid=993(elasticsearch) groups=993(elasticsearch)
uid=996(kibana) gid=994(kibana) groups=994(kibana)

DLPA3
uid=996(elasticsearch) gid=994(elasticsearch) groups=994(elasticsearch)

DLPA5
uid=996(elasticsearch) gid=994(elasticsearch) groups=994(elasticsearch)

DLPA6:
uid=996(elasticsearch) gid=994(elasticsearch) groups=994(elasticsearch)

DLPA1
uid=995(elasticsearch) gid=993(elasticsearch) groups=993(elasticsearch)
DLPA1
uid=996(kibana) gid=994(kibana) groups=994(kibana)

DLPA2
uid=996(elasticsearch) gid=994(elasticsearch) groups=994(elasticsearch)
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.