Snapshot Recovery fails due to connection issues with GCS repository

Hello,

I've been running into connection issues while trying to restore a snapshot from GCS repository. We have multiple clusters and other ES clusters can connect to a GCS bucket within same project on google cloud but with this specific cluster we've been seeing these connection issues since past few days every time we retry the restore.

The restore has 30+ indices and the connection issue is specific to 4-5 indices which are in red status. Some of the indices are around 10TB in size and have some data already restored.

Below is the stack trace(trimmed for char limit) from the node for one of the index - all of them are failing with same error:

    {"type": "ElasticSearch", "timestamp": "2021-02-18T18:14:42,070Z", "level": "WARN", "component": "o.e.i.c.IndicesClusterStateService", "cluster.name": "escluster", "node.name": "escluster--elasticsearch-data-1", "message": "[orders.9][161] marking and sending shard failed due to [failed recovery]", "cluster.uuid": "al83ZmR6TguwcufYK9UQsw", "node.id": "rUxzn2XbRl-aJWD4okWNBw" ,
"stacktrace": ["org.elasticsearch.indices.recovery.RecoveryFailedException: [orders.9][161]: Recovery failed on {escluster--elasticsearch-data-1}{rUxzn2XbRl-aJWD4okWNBw}{6p-7AqEKRNuMTZgHgt1bIg}{10.77.1.6}{10.77.1.6:9300}{dilm}{ml.machine_memory=68719476736, xpack.installed=true, zone=us-central1-c, ml.max_open_jobs=20}",
"at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$17(IndexShard.java:2584) [elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.2.jar:7.5.2]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:830) [?:?]",
"Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery",
"at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:353) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:283) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1867) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$17(IndexShard.java:2580) ~[elasticsearch-7.5.2.jar:7.5.2]",
"... 4 more",
"Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed",
"at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:480) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$5(StoreRecovery.java:285) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:283) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1867) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$17(IndexShard.java:2580) ~[elasticsearch-7.5.2.jar:7.5.2]",
"... 4 more",
"Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [2021-02-18t12-39-32/B8ySkbdGTUuvLfo91W-06Q]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:1255) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.repositories.Repository.restoreShard(Repository.java:231) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:469) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$5(StoreRecovery.java:285) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:283) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1867) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$17(IndexShard.java:2580) ~[elasticsearch-7.5.2.jar:7.5.2]",
"... 4 more",
"Caused by: com.google.cloud.storage.StorageException: Connection reset",
"at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:227) ~[?:?]",
"at com.google.cloud.storage.spi.v1.HttpStorageRpc.read(HttpStorageRpc.java:690) ~[?:?]",
"at com.google.cloud.storage.BlobReadChannel$1.call(BlobReadChannel.java:127) ~[?:?]",
"at com.google.cloud.storage.BlobReadChannel$1.call(BlobReadChannel.java:124) ~[?:?]",
"at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105) ~[?:?]",
"at com.google.cloud.RetryHelper.run(RetryHelper.java:76) ~[?:?]",
"at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50) ~[?:?]",
"at com.google.cloud.storage.BlobReadChannel.read(BlobReadChannel.java:123) ~[?:?]",
"at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore$1.lambda$read$0(GoogleCloudStorageBlobStore.java:189) ~[?:?]",
"at java.security.AccessController.doPrivileged(AccessController.java:554) ~[?:?]",
"at org.elasticsearch.repositories.gcs.SocketAccess.doPrivilegedIOException(SocketAccess.java:44) ~[?:?]",
"at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore$1.read(GoogleCloudStorageBlobStore.java:189) ~[?:?]",
"at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) ~[?:?]",
"at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) ~[?:?]",
"at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[?:?]",
"at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:92) ~[elasticsearch-7.5.2.jar:7.5.2]",
"at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:?]",
"at org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:69) ~[elasticsearch-7.5.2.jar:7.5.2]",
"... 4 more",
"Caused by: javax.net.ssl.SSLException: Connection reset",
"at sun.security.ssl.Alert.createSSLException(Alert.java:127) ~[?:?]",
"at sun.security.ssl.TransportContext.fatal(TransportContext.java:324) ~[?:?]",
"at sun.security.ssl.TransportContext.fatal(TransportContext.java:267) ~[?:?]",
"at sun.security.ssl.TransportContext.fatal(TransportContext.java:262) ~[?:?]",
"at sun.security.ssl.SSLSocketImpl.handleException(SSLSocketImpl.java:1652) ~[?:?]",
"at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1038) ~[?:?]",
"at java.io.BufferedInputStream.fill(BufferedInputStream.java:245) ~[?:?]",
"at java.io.BufferedInputStream.read1(BufferedInputStream.java:285) ~[?:?]",
"at java.io.BufferedInputStream.read(BufferedInputStream.java:344) ~[?:?]",
"at sun.net.www.MeteredStream.read(MeteredStream.java:134) ~[?:?]",
"at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:?]",
"at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3505) ~[?:?]",
"at com.google.api.client.http.javanet.NetHttpResponse$SizeValidatingInputStream.read(NetHttpResponse.java:164) ~[?:?]",
"at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:?]",
"at com.google.api.client.util.ByteStreams.copy(ByteStreams.java:49) ~[?:?]",
"at com.google.api.client.util.IOUtils.copy(IOUtils.java:87) ~[?:?]",
"at com.google.api.client.util.IOUtils.copy(IOUtils.java:59) ~[?:?]",
"at com.google.api.client.http.HttpResponse.download(HttpResponse.java:386) ~[?:?]",

"... 4 more",
"Suppressed: java.net.SocketException: Broken pipe",
"\tat sun.nio.ch.NioSocketImpl.implWrite(NioSocketImpl.java:421) ~[?:?]",
"\tat sun.nio.ch.NioSocketImpl.write(NioSocketImpl.java:441) ~[?:?]",
"\tat sun.nio.ch.NioSocketImpl$2.write(NioSocketImpl.java:825) ~[?:?]",
"\tat java.net.Socket$SocketOutputStream.write(Socket.java:1007) ~[?:?]",
"\tat sun.security.ssl.SSLSocketOutputRecord.encodeAlert(SSLSocketOutputRecord.java:82) ~[?:?]",
"\tat sun.security.ssl.TransportContext.fatal(TransportContext.java:355) ~[?:?]",
"\tat sun.security.ssl.TransportContext.fatal(TransportContext.java:267) ~[?:?]",
"\tat sun.security.ssl.TransportContext.fatal(TransportContext.java:262) ~[?:?]",
"\tat sun.security.ssl.SSLSocketImpl.handleException(SSLSocketImpl.java:1652) ~[?:?]",
"\tat sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1038) ~[?:?]",
"\tat java.io.BufferedInputStream.fill(BufferedInputStream.java:245) ~[?:?]",
"\tat java.io.BufferedInputStream.read1(BufferedInputStream.java:285) ~[?:?]",
"\tat java.io.BufferedInputStream.read(BufferedInputStream.java:344) ~[?:?]",
"\tat sun.net.www.MeteredStream.read(MeteredStream.java:134) ~[?:?]",
"\tat java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:?]",
"\tat sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3505) ~[?:?]"

"\tat java.lang.Thread.run(Thread.java:830) [?:?]",
"Caused by: java.net.SocketException: Connection reset",
"at sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:324) ~[?:?]",
"at sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:351) ~[?:?]",
"at sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:802) ~[?:?]",
"at java.net.Socket$SocketInputStream.read(Socket.java:937) ~[?:?]",
"at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:450) ~[?:?]",
"at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68) ~[?:?]",
"at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1409) ~[?:?]",
"at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1022) ~[?:?]",
"at java.io.BufferedInputStream.fill(BufferedInputStream.java:245) ~[?:?]",
"at java.io.BufferedInputStream.read1(BufferedInputStream.java:285) ~[?:?]",
"at java.io.BufferedInputStream.read(BufferedInputStream.java:344) ~[?:?]",
"at sun.net.www.MeteredStream.read(MeteredStream.java:134) ~[?:?]",
"at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:?]",
"at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3505) ~[?:?]",
"at com.google.api.client.http.javanet.NetHttpResponse$SizeValidatingInputStream.read(NetHttpResponse.java:164) ~[?:?]",
"at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:?]",

"... 4 more"] }

Trimmed stack trace due to character limit. Full stack trace pasted here: https://pastebin.com/sQFf3kLS

Apparently, GCS SDK does not resume downloads that fail part-way through:

The newer ES versions have introduced a retry logic :

1 Like