Snapshot to GCS repository times out

Hello,

We never had problems backing up our data to a filesystem type: fs repositories but since migrating to GCS repository snapshots systematically time out with 1 to 3 shards out of 64 not being snapshot correctly after 20-35 minutes running.

Reasons are socket timeouts:

"reason": "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: SocketTimeoutException[Read timed out]; "

or 503 Service Unavailable:

"reason": "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: GoogleJsonResponseException[503 Service Unavailable\nService Unavailable]; "

all related to GCS are than Elasticsearch directly I presume.

Our ES version is 5.2.2. We back-up data from an ES cluster running on GCE instances to GCS. Index size is about 200GB Anyone else encountering similar issues? What can be done about them?

Thank you,

David

Hello David,

It appears to be a connectivity issue with GCS. Are the shards that are failing always on the same node? You will have to see on which node the primary shard resides for each shard, as snapshotting takes place from the primary shard, never the replica.

Can you try using a script or GCS CLI tool to upload the snapshotted files and see if there are any issues executing the file uploads to GCS outside of Elasticsearch?

Such issues often relate back to the environment. In your case, since you're getting 503 Service Unavailable, it seems to be a connectivity issue.

Is there a full stack trace from the error that you can share?

Hello Ali,

Thank you for the reply.

Are the shards that are failing always on the same node?

No, all nodes are concerned.

Can you try using a script or GCS CLI tool to upload the snapshotted files and see if there are any issues executing the file uploads to GCS outside of Elasticsearch?

This works. Specifically, before gcs-repository we would first create an fs-type backup to a repository on another GCE instance (i.e not on the cluster nodes) and push this backup via gustil.

Is there a full stack trace from the error that you can share?

All I have is the logs I posted in the github issue. I could try and lower the log level for the repository service.

it seems to be a connectivity issue.

The status does seem to indicate it, however, all nodes are on GCE so optimal conditions and as I said we have not encountered issues when pushing larger files via gustil.

Do you notice this issue happening at the start of a snapshotting process or more towards the end?

There is nothing in the logs with GoogleCloudStorageBlobContainer in the stack trace, or anything similar? It should be right where the stack trace you posted on the Github issue should be. Strange..

Hi,

I turned org.elasticsearch.snapshots.SnapshotShardsService logging to DEBUG to provide you more information but for some reason the snapshots go through now so I can't reproduce. I will add the information here once we encounter the issue again.

Thanks.

David

Sounds good

Hello, so over the past 9 days the snapshot failed 5 times. The debug level logs do not reveal new information, I include an example with 2 out of 63 shards failing, see pastebin. As I mentioned before, ES is running on GCE instances so one can hardly do better as far as connectivity to GCS is concerned.

Any advice?

Regards,

David

Hi David,

Are you using some kind of network volume on your GCE instances? The exceptions indicate there is a read timeout, meaning reading from the ES (Lucene) index in order to write the snapshot to GCS. I'm wondering if you have an EBS like volume on your GCE instance?

I'm surprised a full stack trace hasn't appeared on the console or system logs. Do you have access to those?

My bad - wrong grep, here is the missing stack trace information for one of the 2 shards that failed to snapshot correctly, the other logs for the other shard are similar:

Apr 28 03:28:36 es-host-002 elasticsearch[es-host-002][snapshot][T#2] WARN SnapshotShardsService - [[index_crm_0][22]] [gcs_repository:production-index-20170428/23D3ndo2TjuvJn-izifURA] failed to create snapshot [index_crm_0/mij0Os3cQHK936GRI32VyQ][[index_crm_0][22]] IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: SocketTimeoutException[Read timed out];
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1366)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:961)
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382)
at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88)
at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:704)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1569)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
at com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.api.client.googleapis.media.MediaHttpUploader.executeCurrentRequestWithoutGZip(MediaHttpUploader.java:545)
at com.google.api.client.googleapis.media.MediaHttpUploader.resumableUpload(MediaHttpUploader.java:417)
at com.google.api.client.googleapis.media.MediaHttpUploader.upload(MediaHttpUploader.java:336)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:427)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at org.elasticsearch.common.blobstore.gcs.GoogleCloudStorageBlobStore.lambda$writeBlob$5(GoogleCloudStorageBlobStore.java:219)
at java.security.AccessController.doPrivileged(Native Method)
at org.elasticsearch.common.blobstore.gcs.GoogleCloudStorageBlobStore.doPrivileged(GoogleCloudStorageBlobStore.java:333)
at org.elasticsearch.common.blobstore.gcs.GoogleCloudStorageBlobStore.writeBlob(GoogleCloudStorageBlobStore.java:213)
at org.elasticsearch.common.blobstore.gcs.GoogleCloudStorageBlobContainer.writeBlob(GoogleCloudStorageBlobContainer.java:72)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1422)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1364)
... 9 more

Yes, this stacktrace indicates the issue is the socket is timing out reading the data that needs to be uploaded to google cloud storage. A similar issue is reported here: java - SocketTimeoutException: Read timed out on Insert to Google Drive - is it retryable? - Stack Overflow

I'm wondering if the shard in question (index index_crm_0 shard 22) is a large shard? You can check the size by executing

curl -XGET localhost:9200/_cat/shards

If its very large, I'm guessing it may be timing out. In that case, you can try increasing these values (default is 20s) by setting http.read_timeout and http.connect_timeout on the repository's settings: Google Cloud Storage repository plugin | Elasticsearch Plugins and Integrations [master] | Elastic. For example:

PUT _snapshot/my_gcs_repository
{
"type": "gcs",
"settings": {
"bucket": "my_bucket",
"service_account": "service_account.json",
"http.read_timeout": "1m"
}
}

I'll be interested to know if that solves your problem

Hi, I updated the configuration but the problem persists, I include the repository configuration:

{
  "gcs_repository": {
    "type": "gcs",
    "settings": {
      "bucket": "backup.<replaced>.com",
      "service_account": "_default_",
      "http": {
        "connect_timeout": "1m",
        "read_timeout": "1m"
      },
      "base_path": "elasticsearch/snapshots-v5",
      "compress": "true",
      "max_snapshot_bytes_per_sec": "100mb"
    }
  }
}

Can we try bumping the timeouts to something way higher, like 30m, to see if it helps? Also, can you enable debug logging on the GCS APIs? i.e.

PUT /_cluster/settings
{
    "transient" : {
        "logger.com.google.api" : "DEBUG"
    }
}

Hello,

Thank you for the suggestions. This did not result in additional logs. The snapshots have been failing for 5 days in a row now so this is becoming critical for us and we'll have to go back to using fs type snapshots pushed to GCS in a separate step.

I would be interested in feedback from other users backing up their snapshots to GCS. Without having studied the plugin's code - could it be that no attempts to retry are made?

I include some recent logs, the error is of the Service Unavailable variety mentioned in original post.

May  5 03:39:10 es-host-002 elasticsearch[es-host-002][snapshot][T#12] WARN SnapshotShardsService - [[index_crm_0][3]] [gcs_repository:production-index-20170505/wDEYc_-yQu-AmA78pzdieg] failed to create snapshot [index_crm_0/mij0Os3cQHK936GRI32VyQ][[index_crm_0][3]] IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: GoogleJsonResponseException[503 Service Unavailable
Service Unavailable];
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1366)
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:961)
    at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382)
    at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88)
    at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 503 Service Unavailable
Service Unavailable
    at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
    at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:432)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
    at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
    at org.elasticsearch.common.blobstore.gcs.GoogleCloudStorageBlobStore.lambda$writeBlob$5(GoogleCloudStorageBlobStore.java:219)
    at java.security.AccessController.doPrivileged(Native Method)
    at org.elasticsearch.common.blobstore.gcs.GoogleCloudStorageBlobStore.doPrivileged(GoogleCloudStorageBlobStore.java:333)
    at org.elasticsearch.common.blobstore.gcs.GoogleCloudStorageBlobStore.writeBlob(GoogleCloudStorageBlobStore.java:213)
    at org.elasticsearch.common.blobstore.gcs.GoogleCloudStorageBlobContainer.writeBlob(GoogleCloudStorageBlobContainer.java:72)
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1422)
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1364)
    ... 9 more

Given this, I wonder if you have a rate limit to GCS? The issue is that in 5.2, we introduced an incompatible-snapshots which has led to performance regressions in querying the cloud storage repository too many times, see GET /_snapshot/[repo]/_all slow/inefficient · Issue #24509 · elastic/elasticsearch · GitHub. We are working on a fix. In light of this, I wonder if GCS is rate limiting your ES instance, causing some of them to fail. Is it possible to try this on Elasticsearch v5.1.2 (Elasticsearch 5.1.2 | Elastic) and see if you still have the problem?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.