What is actually causing these shard snapshot failures?

Russell_Fulton · September 3, 2023, 7:50pm

And how do I fix them ; )

I have 7 shards failing with a message of the form: INTERNAL_SERVER_ERROR: NoSuchFileException[/data/elasticsearch/backups/daily/indices/L0OEoJ_DSqOk8aNpntkxqQ/0/index-MD1wjtsmTBuEPMY_zenaaQ]

It is certainly true that the files don't not exist but it gives no indication as to how to fix the issue.

The same 7 shards are failing each day.

These problems started a while back, I think coincident with me adding a new data node to the cluster. I did not spot the problem until I was attempting to restore an index recently.

There was a network issue between the new node and one other cluster members (missing firewall rule), fixing that brought the number of failed shards down from around 40 to 7.

The backup are done to a shared disk (via sshfs).

DavidTurner · September 3, 2023, 8:12pm

Is there any more to the message? A stack trace in the logs perhaps?

Russell_Fulton · September 3, 2023, 8:40pm

Ah! did not occur to me to look in in the server logs...

[2023-09-04T00:30:01,293][WARN ][o.e.s.SnapshotShardsService] [secesprd01] [[winlogbeat-7.16.1-2023.08.15-000067][0]][daily:daily-2023.09.03-pfwfjek1r4imr8qqduj_pq/WJ4laK_iReix2n0GHx2O9Q] failed to snapshot shard
java.nio.file.NoSuchFileException: /data/elasticsearch/backups/daily/indices/X3TaCJ08TTSG8l5-_YQzvA/0/index-EkVzwXdVRLqEbAuWaY5VnA
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:261) ~[?:?]
        at java.nio.file.Files.newByteChannel(Files.java:379) ~[?:?]
        at java.nio.file.Files.newByteChannel(Files.java:431) ~[?:?]
        at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422) ~[?:?]
        at java.nio.file.Files.newInputStream(Files.java:159) ~[?:?]
        at org.elasticsearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:210) ~[elasticsearch-7.17.12.jar:7.17.12]
        at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:88) ~[elasticsearch-7.17.12.jar:7.17.12]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:3405) ~[elasticsearch-7.17.12.jar:7.17.12]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2655) [elasticsearch-7.17.12.jar:7.17.12]
        at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) [elasticsearch-7.17.12.jar:7.17.12]
        at org.elasticsearch.snapshots.SnapshotShardsService.lambda$newShardSnapshotTask$2(SnapshotShardsService.java:281) [elasticsearch-7.17.12.jar:7.17.12]
        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewSnapshots$1(SnapshotShardsService.java:249) [elasticsearch-7.17.12.jar:7.17.12]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.12.jar:7.17.12]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.lang.Thread.run(Thread.java:1623) [?:?]

Carl_Newkirk · September 4, 2023, 5:56am

same problems
What can i do.

DavidTurner · September 4, 2023, 6:35am

Hmm ok, that stack trace means that Elasticsearch definitely wrote this file at some point in the past. Do you know what version you were running when these problem started? Could it have been 7.14.0 or earlier, which had a known issue related to snapshots?

Russell_Fulton · September 5, 2023, 1:02am

I have been on 7.17.x for well before the problem started.

I think the problem started when I added a new node to the cluster. I discovered that there were some missing iptables rules so some of the other cluster members could not communicate with the new node on 930x.

DavidTurner · September 5, 2023, 6:39am

I don't think inter-node connectivity problems could directly explain this, but there's no known issues that would explain it either and we added a lot more testing in 7.14 to make sure this area was well-covered. Elasticsearch thinks it wrote a file but the file is now missing. I suspect SSHFS could be the culprit unfortunately, there aren't many folks using that and I can't see any docs about the strength of its write durability guarantees.

In terms of a path forwards, there's no way to repair a repository in this state. It'd be best to start a new snapshot repo for all future snapshots and let the old ones age out.

Russell_Fulton · September 6, 2023, 9:35pm

A picture, as they say is worth a thousand words:

Thanks!

system · October 4, 2023, 9:36pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Snapshot Failures in elastic cloud 's elastic search Elasticsearch snapshot-and-restore	4	413	April 28, 2021
Snapshot Restore Fails with "NoSuchFileException" Errors - Need Help Elastic Search snapshot-and-restore	1	58	September 23, 2024
Understanding Index Shard Snapshot Failed Exception Elasticsearch	4	3442	January 2, 2019
Failed to snapshot shard# 2 Elasticsearch docker	1	317	January 20, 2023
Single shard failing with snapshot Elasticsearch	4	311	November 7, 2023

What is actually causing these shard snapshot failures?

Related topics