Snapshot restore process is not finished

Hi ,

We have recently upgraded our Elasticsearch cluster to version 1.4.1 , i
did a snapshot to one of the indexes (Index 2014_11,size:1.44 Ti with 1
Replica)and it was successful.

When i am trying to restore the snapshot from s3 to a new cluster of
Elasticsearch i have created (Also Es version 1.4.1,total of 3 Data nodes ,
Aws Cloud Plugin : 2.4.1) i get the following log:
[2014-12-08 19:21:41,370][WARN ][cluster.action.shard ]
[elasticsearch-sndbx01] [2014_11][1] sending failed shard for [2014_11][1],
node[rx0dvWQMTpKd68MdkAyw0w], [P], restoring[my_s3_repository:2014-11],
s[INITIALIZING], indexUUID [Ms7CtS7kRg6pth9YXgPSEg], reason [Failed to
start shard, message [IndexShardGatewayRecoveryException[[2014_11][1]
failed recovery]; nested: IndexShardRestoreFailedException[[2014_11][1]
restore failed]; nested: IndexShardRestoreFailedException[[2014_11][1]
failed to restore snapshot [2014-11]]; nested:
IndexShardRestoreFailedException[[2014_11][1] Failed to recover index];
nested: SocketTimeoutException[Read timed out]; ]]
[2014-12-08 19:23:04,128][WARN ][indices.cluster ]
[elasticsearch-sndbx01] [2014_11][5] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[2014_11][5] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[2014_11][5] restore failed
at
org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
... 3 more
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[2014_11][5] failed to restore snapshot [2014-11]
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:165)
at
org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
... 4 more
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[2014_11][5] Failed to recover index
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:787)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
... 5 more
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:577)
at sun.security.ssl.InputRecord.read(InputRecord.java:532)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:911)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at
org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
at
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182)
at
org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
at
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at
org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:92)
at java.io.InputStream.read(InputStream.java:101)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:834)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
... 6 more
[2014-12-08 19:23:04,129][WARN ][cluster.action.shard ]
[elasticsearch-sndbx01] [2014_11][5] sending failed shard for [2014_11][5],
node[rx0dvWQMTpKd68MdkAyw0w], [P], restoring[my_s3_repository:2014-11],
s[INITIALIZING], indexUUID [Ms7CtS7kRg6pth9YXgPSEg], reason [Failed to
start shard, message [IndexShardGatewayRecoveryException[[2014_11][5]
failed recovery]; nested: IndexShardRestoreFailedException[[2014_11][5]
restore failed]; nested: IndexShardRestoreFailedException[[2014_11][5]
failed to restore snapshot [2014-11]]; nested:
IndexShardRestoreFailedException[[2014_11][5] Failed to recover index];
nested: SocketTimeoutException[Read timed out]; ]]

In addition , i have noticed that the index restoring process takes a lot
of time , about 100 Giga for 10 Hours.
My current problem is that i have an index of 740G which i want to reindex
, but i cant perform this operation becuase the snapshot does not finish
for 2 days.

Is that normal ?
Please advise?

Thanks,
Costya,Totango.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e09211b6-43d2-482b-bcad-d2924ae08df5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

Was there any resolution on this? We have a very similar issue - slow recovery, the snapshot in s3 bucket (in fact GCS with S3 compatibility)

We see this in the logs:

[2015-07-07 09:16:23,870][WARN ][indices.cluster          ] [am2-elasticsearch-backup01-backup-si] [mdb-pod2000-1][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [mdb-pod2000-1][4] failed recovery
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [mdb-pod2000-1][4] restore failed
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
    ... 3 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [mdb-pod2000-1][4] failed to restore snapshot [20150630141619]
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:158)
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
    ... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [mdb-pod2000-1][4] Failed to recover index
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:836)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:155)
    ... 5 more
Caused by: java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:152)
    at java.net.SocketInputStream.read(SocketInputStream.java:122)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
    at sun.security.ssl.InputRecord.read(InputRecord.java:480)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
    at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
    at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182)
    at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
    at java.security.DigestInputStream.read(DigestInputStream.java:161)
    at com.amazonaws.services.s3.internal.DigestValidationInputStream.read(DigestValidationInputStream.java:59)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at org.elasticsearch.cloud.aws.blobstore.AbstractS3BlobContainer$1.run(AbstractS3BlobContainer.java:99)
    ... 3 more

It looks like we recovered only 20GB in 12h. Is this a normal rate? Is there a way to debug a bottle neck?

Thanks,
Igor

+1 ... I would also like to watch out this thread to know possible issues with snapshot/replication

OK, so we solved the issue on our end and now the restore completes even for datasets 5TB+

We tinker with using google cloud storage buckets in S3 compatibility mode to store our cluster backups. We created nearline buckets. The issue was that the nearline buckets are throttled heavily. Because of this ES was not able to receover even a single shard (it go timeout before a shard was recovered, then restarted this recovery shard, timeout again, etc.). We played with many different settings, but the result was always the same. Not even a single shard was recovered.

The solution was simply to create Durable Reduced Availability (DRA) buckets. They are not throttled and now we can recover even few TB datasets in few hours.

If you see this SocketTimeoutException I advise you to check if you are not throttled when reading from your repository. If this is the case then probably ES can't recover even a single shard, as it hits timeout and restarts the shard recovery.

1 Like