Snapshot restore process is not finished

costya · December 8, 2014, 7:41pm

Hi ,

We have recently upgraded our Elasticsearch cluster to version 1.4.1 , i
did a snapshot to one of the indexes (Index 2014_11,size:1.44 Ti with 1
Replica)and it was successful.

When i am trying to restore the snapshot from s3 to a new cluster of
Elasticsearch i have created (Also Es version 1.4.1,total of 3 Data nodes ,
Aws Cloud Plugin : 2.4.1) i get the following log:
[2014-12-08 19:21:41,370][WARN ][cluster.action.shard ]
[elasticsearch-sndbx01] [2014_11][1] sending failed shard for [2014_11][1],
node[rx0dvWQMTpKd68MdkAyw0w], [P], restoring[my_s3_repository:2014-11],
s[INITIALIZING], indexUUID [Ms7CtS7kRg6pth9YXgPSEg], reason [Failed to
start shard, message [IndexShardGatewayRecoveryException[[2014_11][1]
failed recovery]; nested: IndexShardRestoreFailedException[[2014_11][1]
restore failed]; nested: IndexShardRestoreFailedException[[2014_11][1]
failed to restore snapshot [2014-11]]; nested:
IndexShardRestoreFailedException[[2014_11][1] Failed to recover index];
nested: SocketTimeoutException[Read timed out]; ]]
[2014-12-08 19:23:04,128][WARN ][indices.cluster ]
[elasticsearch-sndbx01] [2014_11][5] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[2014_11][5] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[2014_11][5] restore failed
at
org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
... 3 more
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[2014_11][5] failed to restore snapshot [2014-11]
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:165)
at
org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
... 4 more
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[2014_11][5] Failed to recover index
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:787)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
... 5 more
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:577)
at sun.security.ssl.InputRecord.read(InputRecord.java:532)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:911)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
at
org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
at
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182)
at
org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
at
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at
org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:92)
at java.io.InputStream.read(InputStream.java:101)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:834)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
... 6 more
[2014-12-08 19:23:04,129][WARN ][cluster.action.shard ]
[elasticsearch-sndbx01] [2014_11][5] sending failed shard for [2014_11][5],
node[rx0dvWQMTpKd68MdkAyw0w], [P], restoring[my_s3_repository:2014-11],
s[INITIALIZING], indexUUID [Ms7CtS7kRg6pth9YXgPSEg], reason [Failed to
start shard, message [IndexShardGatewayRecoveryException[[2014_11][5]
failed recovery]; nested: IndexShardRestoreFailedException[[2014_11][5]
restore failed]; nested: IndexShardRestoreFailedException[[2014_11][5]
failed to restore snapshot [2014-11]]; nested:
IndexShardRestoreFailedException[[2014_11][5] Failed to recover index];
nested: SocketTimeoutException[Read timed out]; ]]

In addition , i have noticed that the index restoring process takes a lot
of time , about 100 Giga for 10 Hours.
My current problem is that i have an index of 740G which i want to reindex
, but i cant perform this operation becuase the snapshot does not finish
for 2 days.

Is that normal ?
Please advise?

Thanks,
Costya,Totango.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e09211b6-43d2-482b-bcad-d2924ae08df5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

igor_k · July 7, 2015, 9:26am

Hi,

Was there any resolution on this? We have a very similar issue - slow recovery, the snapshot in s3 bucket (in fact GCS with S3 compatibility)

We see this in the logs:

[2015-07-07 09:16:23,870][WARN ][indices.cluster          ] [am2-elasticsearch-backup01-backup-si] [mdb-pod2000-1][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [mdb-pod2000-1][4] failed recovery
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [mdb-pod2000-1][4] restore failed
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
    ... 3 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [mdb-pod2000-1][4] failed to restore snapshot [20150630141619]
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:158)
    at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
    ... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [mdb-pod2000-1][4] Failed to recover index
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:836)
    at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:155)
    ... 5 more
Caused by: java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:152)
    at java.net.SocketInputStream.read(SocketInputStream.java:122)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
    at sun.security.ssl.InputRecord.read(InputRecord.java:480)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
    at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
    at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182)
    at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
    at java.security.DigestInputStream.read(DigestInputStream.java:161)
    at com.amazonaws.services.s3.internal.DigestValidationInputStream.read(DigestValidationInputStream.java:59)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at org.elasticsearch.cloud.aws.blobstore.AbstractS3BlobContainer$1.run(AbstractS3BlobContainer.java:99)
    ... 3 more

It looks like we recovered only 20GB in 12h. Is this a normal rate? Is there a way to debug a bottle neck?

Thanks,
Igor

mosiddi · July 7, 2015, 9:35am

+1 ... I would also like to watch out this thread to know possible issues with snapshot/replication

igor_k · July 13, 2015, 10:31am

OK, so we solved the issue on our end and now the restore completes even for datasets 5TB+

We tinker with using google cloud storage buckets in S3 compatibility mode to store our cluster backups. We created nearline buckets. The issue was that the nearline buckets are throttled heavily. Because of this ES was not able to receover even a single shard (it go timeout before a shard was recovered, then restarted this recovery shard, timeout again, etc.). We played with many different settings, but the result was always the same. Not even a single shard was recovered.

The solution was simply to create Durable Reduced Availability (DRA) buckets. They are not throttled and now we can recover even few TB datasets in few hours.

If you see this SocketTimeoutException I advise you to check if you are not throttled when reading from your repository. If this is the case then probably ES can't recover even a single shard, as it hits timeout and restarts the shard recovery.

Topic		Replies	Views
Index Shard Restore Failed Except When Restoring From Snapshot Elasticsearch	1	1392	July 5, 2017
ElasticSearch LockObtainFailedException on restoring index from s3 repository Elasticsearch	1	1045	July 5, 2017
Restore from S3 Snapshot Elasticsearch slm-snapshot-lifecycle-management , snapshot-and-restore	1	238	December 13, 2023
Index shard got corrupted Elasticsearch	3	3122	July 6, 2017
No progress in index restore Elasticsearch	4	1378	July 5, 2017

Snapshot restore process is not finished

Related topics