Cannot list snapshots in S3 repository (previously working fine)

Hi

I've been using the repository-s3 plugin with Elasticsearch 5.3.3 happily for months. A couple of days ago, our regular snapshot restore process started failing. It shows a timeout attempting to read the list of snapshots in the repository.

I can still see the repository listed when I list repositories using this URL:

http://myhost:9200/_snapshot/_all

but when I try to list the snapshots available in the repository like this:

http://myhost:9200/_snapshot/myrepo/_all

The request just hangs indefinitely. I see no logging at all on the ES host in /var/log/elasticsearch/elasticsearch.log

Nothing related to S3 permissions or the instance running ES has changed (this is all managed by Terraform so I can be quite certain that the configuration has not changed in months).

Can anyone suggest anything I can do to diagnose the problem? Can I turn on debug logging of some kind for the repository-s3 plugin?

I realised that the logging for this error appears on the master. The stack trace I see in the log when this timeout occurs is:

[2017-10-17T11:38:04,620][WARN ][r.suppressed             ] path: /_snapshot/myrepo/_all, params: {repository=myrepo, snapshot=_all}
com.amazonaws.AmazonClientException: Unable to execute HTTP request: connect timed out
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:618) ~[?:?]
    at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376) ~[?:?]
    at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338) ~[?:?]
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287) ~[?:?]
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3654) ~[?:?]
    at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1092) ~[?:?]
    at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:970) ~[?:?]
    at org.elasticsearch.cloud.aws.blobstore.S3BlobContainer.readBlob(S3BlobContainer.java:84) ~[?:?]
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.readSnapshotIndexLatestBlob(BlobStoreRepository.java:912) ~[elasticsearch-5.3.3.jar:5.3.3]
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.latestIndexBlobId(BlobStoreRepository.java:890) ~[elasticsearch-5.3.3.jar:5.3.3]
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:753) ~[elasticsearch-5.3.3.jar:5.3.3]
    at org.elasticsearch.snapshots.SnapshotsService.getRepositoryData(SnapshotsService.java:142) ~[elasticsearch-5.3.3.jar:5.3.3]
    at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:91) [elasticsearch-5.3.3.jar:5.3.3]
    at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:50) [elasticsearch-5.3.3.jar:5.3.3]
    at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:87) [elasticsearch-5.3.3.jar:5.3.3]
    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:166) [elasticsearch-5.3.3.jar:5.3.3]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.3.3.jar:5.3.3]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.3.3.jar:5.3.3]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.net.SocketTimeoutException: connect timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_131]
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_131]
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_131]
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_131]
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_131]
    at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_131]
    at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:668) ~[?:?]
    at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:542) ~[?:?]
    at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412) ~[?:?]
    at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:134) ~[?:?]
    at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:179) ~[?:?]
    at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:328) ~[?:?]
    at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:612) ~[?:?]
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:447) ~[?:?]
    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:884) ~[?:?]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[?:?]
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) ~[?:?]
    at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837) ~[?:?]
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607) ~[?:?]
    ... 20 more

Based on the stack trace, it looks like ES is failing to get the index.latest file. When I attempt to get this file from the same host, using the AWS CLI, the file downloads immediately.

Can I turn on debug logging of some kind for the repository-s3 plugin?

In 5.3, packages are:

  • org.elasticsearch.cloud.aws
  • org.elasticsearch.plugin.repository.s3
  • org.elasticsearch.repositories.s3

In 5.6:

  • org.elasticsearch.repositories.s3

You can change it with:

curl -XPUT 'localhost:9200/_cluster/settings?pretty' -H 'Content-Type: application/json' -d'
{
  "transient": {
    "logger.org.elasticsearch.repositories.s3": "trace"
  }
}
'

Wondering if you could try to upgrade first to latest 5.6?

@Igor_Motov Any other idea?

Hmmm, I fixed this problem by restarting all nodes. So for some reason ES was failing to talk to create an SSL connection to s3.cn-north-1.amazonaws.com.cn, but after a restart this succeeded.

Maybe a stale DNS entry that was cached by the JVM? I don't think this was an EC2/networking issue as curl was able to connect to S3 perfectly well from that box.

Thanks for the advice re logging, this will come in handy next time.

Very good to know that could happen. Thanks for the closure on this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.