No progress in index restore


(Anatoly Petkevich) #1

We have a snapshot of 1TB index on S3 and need to restore in on another cluster.
The version of Elasticsearch is 1.7.2, version of AWS Cloud Plugin is 2.7.1, and number of primary shards is 19.
For the day no shard has been restored and tracking of the restored index via _status API shows up that size_in_bytes property doesn't have a steady grow.
Log file contains a lot of warnings:

[2015-12-01 09:16:29,779][WARN ][indices.cluster ] [i-51a3e5ef] [[ii-documents][7]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [ii-documents][7] failed recovery
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:162)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [ii-documents][7] restore failed
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:135)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:109)
... 3 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [ii-documents][7] failed to restore snapshot [snapshot_221120150700]
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:164)
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:126)
... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [ii-documents][7] Failed to recover index
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:780)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
... 5 more
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:946)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:903)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:92)
at java.io.InputStream.read(InputStream.java:101)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:813)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:777)
... 6 more

There is an open issue https://github.com/elastic/elasticsearch-cloud-aws/issues/149, and a related topic Snapshot restore process is not finished.
Please advise if there is some solution or workaround of this issue


(David Pilato) #2

Is this happening once or every time you try to restore?

It sounds like here that a Timeout happened when reading S3 buckets.


(Anatoly Petkevich) #3

It happens on a regular basis, so that no more than 10% of index data has been restored so far.


(David Pilato) #4

I added a comment on the issue: https://github.com/elastic/elasticsearch-cloud-aws/issues/149#issuecomment-160998554

And may be change the default timeout which is 50s by default.
I'm unsure if this will change anything.

I wonder if the connection is good enough between your machines and S3 buckets. I assume they are in the same region?

The stacktrace shows a typical AWS connection problem. May be we should add a retry by setting http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/ClientConfiguration.html#setMaxErrorRetry(int) but documentation says:

Sets the maximum number of retry attempts for failed retryable requests (ex: 5xx error responses from services).

I'm unsure if a SocketTimeoutException is a retryable request...


(system) #5