Failed to write shard level snapshot

Dear all, I hope you can help me to understand the issue we are having performing a snapshot.
We have a cluster (Elasticsearch version 7.10.1) of 4 nodes (1 ingest and 3 data/master) running as services on Windows server machines;
61 indices with 5 shards and 1 replica for each of them.
We use the Azure repository to store our snapshots that we perform every hour and keeping the last 5 only.
This process has been working properly for a long period until now where we are getting the following error:

"data_streams": [],
  "include_global_state": true,
  "state": "PARTIAL",
  "start_time": "2021-09-20T08:23:00.980Z",
  "start_time_in_millis": 1632126180980,
  "end_time": "2021-09-21T01:31:14.882Z",
  "end_time_in_millis": 1632187874882,
  "duration_in_millis": 61693902,
  "failures": [
	{
	  "index": "itemreaddetailactivities_all",
	  "index_uuid": "itemreaddetailactivities_all",
	  "shard_id": 4,
	  "reason": "IndexShardSnapshotFailedException[Failed to write shard level snapshot metadata for 
	  [prod_bak_202109200823010923/tg-_OeaqQBWIJQo58rVeZA] to [index-ddeEpGCtSAqGXhhU4ifX-A]]; 
	  nested: IOException[Can not write blob index-ddeEpGCtSAqGXhhU4ifX-A]; nested: StorageException[]; 
	  nested: UnknownHostException[xyz.blob.core.windows.net]",
	  "node_id": "tBawpgWfSo-IvqwOASjjcQ",
	  "status": "INTERNAL_SERVER_ERROR"
	}

In the above code I put only one failure record, but there is one for each index with the same reason but different shards id.

"shards": {
	"total": 305,
	"failed": 111,
	"successful": 194
}

In the log file this is the exception details:

[2021-09-21T00:08:10,104][WARN ][o.e.s.SnapshotShardsService] [data_node_01] [[items_2016][3]][my_backup_azure_production:prod_bak_202109200823010923/tg-_OeaqQBWIJQo58rVeZA] failed to snapshot shard
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to write shard level snapshot metadata for [prod_bak_202109200823010923/tg-_OeaqQBWIJQo58rVeZA] to [index-fIce999wRSSpG_Rp6_ccdQ]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2009) [elasticsearch-7.10.1.jar:7.10.1]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:344) [elasticsearch-7.10.1.jar:7.10.1]
	at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:260) [elasticsearch-7.10.1.jar:7.10.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) [elasticsearch-7.10.1.jar:7.10.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: java.io.IOException: Can not write blob index-fIce999wRSSpG_Rp6_ccdQ
	at org.elasticsearch.repositories.azure.AzureBlobContainer.writeBlob(AzureBlobContainer.java:117) ~[?:?]
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.write(ChecksumBlobStoreFormat.java:146) ~[elasticsearch-7.10.1.jar:7.10.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2005) ~[elasticsearch-7.10.1.jar:7.10.1]
	... 6 more
Caused by: com.microsoft.azure.storage.StorageException: 
	at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:87) ~[?:?]
	at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:220) ~[?:?]
	at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadFullBlob(CloudBlockBlob.java:1035) ~[?:?]
	at com.microsoft.azure.storage.blob.CloudBlockBlob.upload(CloudBlockBlob.java:864) ~[?:?]
	at com.microsoft.azure.storage.blob.CloudBlockBlob.upload(CloudBlockBlob.java:743) ~[?:?]
	at org.elasticsearch.repositories.azure.AzureBlobStore.lambda$writeBlob$18(AzureBlobStore.java:339) ~[?:?]
	at org.elasticsearch.repositories.azure.SocketAccess.lambda$doPrivilegedVoidException$0(SocketAccess.java:69) ~[?:?]
	at java.security.AccessController.doPrivileged(AccessController.java:554) ~[?:?]
	at org.elasticsearch.repositories.azure.SocketAccess.doPrivilegedVoidException(SocketAccess.java:68) ~[?:?]
	at org.elasticsearch.repositories.azure.AzureBlobStore.writeBlob(AzureBlobStore.java:338) ~[?:?]
	at org.elasticsearch.repositories.azure.AzureBlobContainer.writeBlob(AzureBlobContainer.java:115) ~[?:?]
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.write(ChecksumBlobStoreFormat.java:146) ~[elasticsearch-7.10.1.jar:7.10.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2005) ~[elasticsearch-7.10.1.jar:7.10.1]
	... 6 more
Caused by: java.net.UnknownHostException: transferfileforelastic.blob.core.windows.net
	at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:567) ~[?:?]
	at java.net.Socket.connect(Socket.java:648) ~[?:?]
	at sun.net.NetworkClient.doConnect(NetworkClient.java:177) ~[?:?]
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:474) ~[?:?]
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:569) ~[?:?]
	at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:265) ~[?:?]
	at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:372) ~[?:?]
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:189) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1194) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1082) ~[?:?]
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:175) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1375) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1350) ~[?:?]
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getOutputStream(HttpsURLConnectionImpl.java:220) ~[?:?]
	at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:100) ~[?:?]
	at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadFullBlob(CloudBlockBlob.java:1035) ~[?:?]
	at com.microsoft.azure.storage.blob.CloudBlockBlob.upload(CloudBlockBlob.java:864) ~[?:?]
	at com.microsoft.azure.storage.blob.CloudBlockBlob.upload(CloudBlockBlob.java:743) ~[?:?]
	at org.elasticsearch.repositories.azure.AzureBlobStore.lambda$writeBlob$18(AzureBlobStore.java:339) ~[?:?]
	at org.elasticsearch.repositories.azure.SocketAccess.lambda$doPrivilegedVoidException$0(SocketAccess.java:69) ~[?:?]
	at java.security.AccessController.doPrivileged(AccessController.java:554) ~[?:?]
	at org.elasticsearch.repositories.azure.SocketAccess.doPrivilegedVoidException(SocketAccess.java:68) ~[?:?]
	at org.elasticsearch.repositories.azure.AzureBlobStore.writeBlob(AzureBlobStore.java:338) ~[?:?]
	at org.elasticsearch.repositories.azure.AzureBlobContainer.writeBlob(AzureBlobContainer.java:115) ~[?:?]
	at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.write(ChecksumBlobStoreFormat.java:146) ~[elasticsearch-7.10.1.jar:7.10.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:2005) ~[elasticsearch-7.10.1.jar:7.10.1]
	... 6 more

If you need more details please ask me, thanks.

Are these warning logs (java.net.UnknownHostException) all happening on one of the machines, e.g. data_node_01? It looks like one of the machines has trouble connecting to the Azure blob service (DNS resolution failure?). Can you restart the nodes to see if that helps?

Yes, I see those warnings only on one machine.

I already restarted all nodes but that didn't fix the issue. Anyway I'm going to restart all the machines (4 nodes) and see if it fixes that issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.