What is actually causing these shard snapshot failures?

And how do I fix them ; )

I have 7 shards failing with a message of the form: INTERNAL_SERVER_ERROR: NoSuchFileException[/data/elasticsearch/backups/daily/indices/L0OEoJ_DSqOk8aNpntkxqQ/0/index-MD1wjtsmTBuEPMY_zenaaQ]

It is certainly true that the files don't not exist but it gives no indication as to how to fix the issue.

The same 7 shards are failing each day.

These problems started a while back, I think coincident with me adding a new data node to the cluster. I did not spot the problem until I was attempting to restore an index recently.

There was a network issue between the new node and one other cluster members (missing firewall rule), fixing that brought the number of failed shards down from around 40 to 7.

The backup are done to a shared disk (via sshfs).

Is there any more to the message? A stack trace in the logs perhaps?

Ah! did not occur to me to look in in the server logs...

[2023-09-04T00:30:01,293][WARN ][o.e.s.SnapshotShardsService] [secesprd01] [[winlogbeat-7.16.1-2023.08.15-000067][0]][daily:daily-2023.09.03-pfwfjek1r4imr8qqduj_pq/WJ4laK_iReix2n0GHx2O9Q] failed to snapshot shard
java.nio.file.NoSuchFileException: /data/elasticsearch/backups/daily/indices/X3TaCJ08TTSG8l5-_YQzvA/0/index-EkVzwXdVRLqEbAuWaY5VnA
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:261) ~[?:?]
        at java.nio.file.Files.newByteChannel(Files.java:379) ~[?:?]
        at java.nio.file.Files.newByteChannel(Files.java:431) ~[?:?]
        at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422) ~[?:?]
        at java.nio.file.Files.newInputStream(Files.java:159) ~[?:?]
        at org.elasticsearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:210) ~[elasticsearch-7.17.12.jar:7.17.12]
        at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:88) ~[elasticsearch-7.17.12.jar:7.17.12]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:3405) ~[elasticsearch-7.17.12.jar:7.17.12]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2655) [elasticsearch-7.17.12.jar:7.17.12]
        at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) [elasticsearch-7.17.12.jar:7.17.12]
        at org.elasticsearch.snapshots.SnapshotShardsService.lambda$newShardSnapshotTask$2(SnapshotShardsService.java:281) [elasticsearch-7.17.12.jar:7.17.12]
        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewSnapshots$1(SnapshotShardsService.java:249) [elasticsearch-7.17.12.jar:7.17.12]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.12.jar:7.17.12]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.lang.Thread.run(Thread.java:1623) [?:?]

same problems
What can i do.

Hmm ok, that stack trace means that Elasticsearch definitely wrote this file at some point in the past. Do you know what version you were running when these problem started? Could it have been 7.14.0 or earlier, which had a known issue related to snapshots?

I have been on 7.17.x for well before the problem started.

I think the problem started when I added a new node to the cluster. I discovered that there were some missing iptables rules so some of the other cluster members could not communicate with the new node on 930x.

I don't think inter-node connectivity problems could directly explain this, but there's no known issues that would explain it either and we added a lot more testing in 7.14 to make sure this area was well-covered. Elasticsearch thinks it wrote a file but the file is now missing. I suspect SSHFS could be the culprit unfortunately, there aren't many folks using that and I can't see any docs about the strength of its write durability guarantees.

In terms of a path forwards, there's no way to repair a repository in this state. It'd be best to start a new snapshot repo for all future snapshots and let the old ones age out.

A picture, as they say is worth a thousand words:

Thanks!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.