I have 7 shards failing with a message of the form: INTERNAL_SERVER_ERROR: NoSuchFileException[/data/elasticsearch/backups/daily/indices/L0OEoJ_DSqOk8aNpntkxqQ/0/index-MD1wjtsmTBuEPMY_zenaaQ]
It is certainly true that the files don't not exist but it gives no indication as to how to fix the issue.
The same 7 shards are failing each day.
These problems started a while back, I think coincident with me adding a new data node to the cluster. I did not spot the problem until I was attempting to restore an index recently.
There was a network issue between the new node and one other cluster members (missing firewall rule), fixing that brought the number of failed shards down from around 40 to 7.
Ah! did not occur to me to look in in the server logs...
[2023-09-04T00:30:01,293][WARN ][o.e.s.SnapshotShardsService] [secesprd01] [[winlogbeat-7.16.1-2023.08.15-000067][0]][daily:daily-2023.09.03-pfwfjek1r4imr8qqduj_pq/WJ4laK_iReix2n0GHx2O9Q] failed to snapshot shard
java.nio.file.NoSuchFileException: /data/elasticsearch/backups/daily/indices/X3TaCJ08TTSG8l5-_YQzvA/0/index-EkVzwXdVRLqEbAuWaY5VnA
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:261) ~[?:?]
at java.nio.file.Files.newByteChannel(Files.java:379) ~[?:?]
at java.nio.file.Files.newByteChannel(Files.java:431) ~[?:?]
at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:422) ~[?:?]
at java.nio.file.Files.newInputStream(Files.java:159) ~[?:?]
at org.elasticsearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:210) ~[elasticsearch-7.17.12.jar:7.17.12]
at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:88) ~[elasticsearch-7.17.12.jar:7.17.12]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:3405) ~[elasticsearch-7.17.12.jar:7.17.12]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2655) [elasticsearch-7.17.12.jar:7.17.12]
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) [elasticsearch-7.17.12.jar:7.17.12]
at org.elasticsearch.snapshots.SnapshotShardsService.lambda$newShardSnapshotTask$2(SnapshotShardsService.java:281) [elasticsearch-7.17.12.jar:7.17.12]
at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewSnapshots$1(SnapshotShardsService.java:249) [elasticsearch-7.17.12.jar:7.17.12]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.12.jar:7.17.12]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.lang.Thread.run(Thread.java:1623) [?:?]
Hmm ok, that stack trace means that Elasticsearch definitely wrote this file at some point in the past. Do you know what version you were running when these problem started? Could it have been 7.14.0 or earlier, which had a known issue related to snapshots?
I have been on 7.17.x for well before the problem started.
I think the problem started when I added a new node to the cluster. I discovered that there were some missing iptables rules so some of the other cluster members could not communicate with the new node on 930x.
I don't think inter-node connectivity problems could directly explain this, but there's no known issues that would explain it either and we added a lot more testing in 7.14 to make sure this area was well-covered. Elasticsearch thinks it wrote a file but the file is now missing. I suspect SSHFS could be the culprit unfortunately, there aren't many folks using that and I can't see any docs about the strength of its write durability guarantees.
In terms of a path forwards, there's no way to repair a repository in this state. It'd be best to start a new snapshot repo for all future snapshots and let the old ones age out.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.