Check if restore operation is still running

JacFal · July 3, 2020, 9:23am

Hello,

Is there any way to monitor restore if wait_for_completion flag is set to false?
For now I use the GET _recovery API. I have few conditions for each shard in indices to check if restore operation is still running:

.source.snapshot == $restore_snapshot (or all the previous snapshots in the repository?)
.stop_time_in_millis != null

Success conditions:

.stage == "DONE"
.index.size.percent == "100.0%"

Is there any more sufficient way to do it? The documentation says about green/red cluster/indices state, but if something goes wrong and I will rely only on that condition, I will get the infinity loop. There is no way to set any timeout in my case as I store very huge amount of data in the database.

Thanks,
Jacek

DavidTurner · July 3, 2020, 9:44am

Use the cluster health API: the cluster health will be yellow during the restore and will eventually go green (on success) or red (on failure).

JacFal · July 3, 2020, 12:06pm

According to the documentation
"When restore operation is executed the cluster typically goes into red state."

Checked. During restore most of the indices have yellow state, but there are few of them which have red state.

DavidTurner · July 3, 2020, 12:22pm

Ah, thanks for pointing that out. That hasn't been true for a long time; I opened https://github.com/elastic/elasticsearch/pull/59015/files to fix those docs.

JacFal · July 4, 2020, 6:50am

What about this? I executed restore on my environment. All the yellow and red indices switched to green after some time (probably at the end of restore). The same thing happened with cluster health state, it switched from green before restore to red during restore and again to green at the end.

DavidTurner · July 4, 2020, 9:37am

That means that something went wrong during the restore, but it was saved by a retry.

JacFal · July 7, 2020, 6:45am

Is it possible to check why it failed? Is retry executed only on indices that switched from yellow to red or all indices from Elasticsearch backup?
As restore can be saved by a retry and health status can switch from red to green during restore, depending only on the health status is not sufficient. Am I wrong?

DavidTurner · July 7, 2020, 7:15am

Yes, here's an article about investigating a red cluster health:

The retry is per-shard; in recent versions there are more tenacious retries at lower levels too.

I'm not sure what you mean by "not sufficient": you should wait until the cluster health is either red or green, and if it goes red then you should investigate ASAP.

JacFal · July 7, 2020, 8:25am

Investigation ASAP is not that needed if restore ends with success (green cluster state) after auto-retry, is it? If not then is it possible to check if there are no possible retries left - the restore is finished and cluster has constant red state? In my opinion this would be the final condition which requires manual investigation.

DavidTurner · July 7, 2020, 9:59am

I think you should always investigate a red cluster during a restore. It might eventually sort itself out, but there's still something wrong if a restore fails part-way through and needs retrying.

The Allocation Explain API described in the article I shared above will tell you if a shard has got to the maximum number of retries.

JacFal · July 9, 2020, 10:15am

Ok, I have following problems with allocating shards:

GET /_cluster/allocation/explain
{ 
   "index": "index_with_red_status", 
   "shard": 0, 
   "primary": true
}

Response:

{
  "index" : "index_with_red_status",
  "shard" : 0,
  "primary" : true,
  "current_state" : "initializing",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-07-09T10:03:43.306Z",
    "failed_allocation_attempts" : 1,
    "details" : "failed shard on node [ymRg964ASrmkefI3uWbNrQ]: failed recovery, failure RecoveryFailedException[[index_with_red_status][0]: Recovery failed on {elasticsearch-data-3}{ymRg964ASrmkefI3uWbNrQ}{k53mojOORkqH4G7KUidQaw}{192.168.1.58}{192.168.1.58:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: IOException[Input/output error]; ",
    "last_allocation_status" : "awaiting_info"
  },
  "current_node" : {
    "id" : "Uq-XbuBaR7Glzakj47ikOQ",
    "name" : "elasticsearch-data-7",
    "transport_address" : "192.168.1.206:9300"
  },
  "explanation" : "the shard is in the process of initializing on node [elasticsearch-data-7], wait until initialization has completed"
}

It happens to random indices during snapshot restore. They are fixed after retry(ies), but could you please tell me what is the possible purpose of the failure? There are no problems with connections between the nodes, all the ES docker containers are up and running, without any restarts.

DavidTurner · July 9, 2020, 10:36am

I think there will be more detail in the logs (either the master node or node ymRg964ASrmkefI3uWbNrQ) at around 2020-07-09T10:03:43.306Z. Can you provide that too?

The phrase Input/output error means EIO whose docs say "Usually used for physical read or write errors" so I think this indicates you have hardware issues. Check the kernel logs with dmesg on node ymRg964ASrmkefI3uWbNrQ for more information.

JacFal · July 9, 2020, 11:01am

Logs from ~2020-07-09T10:03:43.306Z

Data node:

index_with_red_status][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [index_with_red_status][0]: Recovery failed on {elasticsearch-data-3}{ymRg964ASrmkefI3uWbNrQ}{k53mojOORkqH4G7KUidQaw}{192.168.1.58}{192.168.1.58:9300}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2369) [elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.0.1.jar:7.0.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:353) 
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
	at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:495) 
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q]
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: Failed to recover index
Caused by: java.io.IOException: Input/output error
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
	at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:48) ~[?:?]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
	at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:223) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[?:?]
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:252) ~[?:?]
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:292) ~[?:?]
	at java.io.BufferedInputStream.read(BufferedInputStream.java:351) ~[?:?]
	at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:92) ~[elasticsearch-7.0.1.jar:7.0.1]
	at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:?]
	at org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:69) ~[elasticsearch-7.0.1.jar:7.0.1]
	at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:?]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFile(FileRestoreContext.java:261) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFiles(FileRestoreContext.java:238) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restore(FileRestoreContext.java:199) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:868) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:479) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$5(StoreRecovery.java:285) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:283) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1695) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2365) ~[elasticsearch-7.0.1.jar:7.0.1]

Master node:

failing shard [failed shard, shard [index_with_red_status][0], node[ymRg964ASrmkefI3uWbNrQ], [P], recovery_source[snapshot recovery [Hpx9Aq8uRH2pnNjFDTWyzw] from repo:es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q], s[INITIALIZING], a[id=M3RdzhCLQAGYnNFzRSeXKw], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2020-07-09T09:24:23.418Z], delayed=false, details[restore_source[repo/es-snapshot]], allocation_status[fetching_shard_data]], message [failed recovery], failure [RecoveryFailedException[[index_with_red_status][0]: Recovery failed on {elasticsearch-data-3}{ymRg964ASrmkefI3uWbNrQ}{k53mojOORkqH4G7KUidQaw}{192.168.1.58}{192.168.1.58:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: IOException[Input/output error]; ], markAsStale [true]]}
org.elasticsearch.indices.recovery.RecoveryFailedException: [index_with_red_status][0]: Recovery failed on {elasticsearch-data-3}{ymRg964ASrmkefI3uWbNrQ}{k53mojOORkqH4G7KUidQaw}{192.168.1.58}{192.168.1.58:9300}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2369) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-7.0.1.jar:7.0.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:353) 
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
	at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:495)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q]
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: Failed to recover index
Caused by: java.io.IOException: Input/output error
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
	at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:48) ~[?:?]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
	at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:223) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[?:?]
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:252) ~[?:?]
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:292) ~[?:?]
	at java.io.BufferedInputStream.read(BufferedInputStream.java:351) ~[?:?]
	at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:92) ~[elasticsearch-7.0.1.jar:7.0.1]
	at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:?]
	at org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:69) ~[elasticsearch-7.0.1.jar:7.0.1]
	at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:?]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFile(FileRestoreContext.java:261) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFiles(FileRestoreContext.java:238) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restore(FileRestoreContext.java:199) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:868) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:479) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$5(StoreRecovery.java:285) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:283) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1695) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2365) ~[elasticsearch-7.0.1.jar:7.0.1]

Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[index_with_red_status][0]] ...]).

Stack traces are reduced to fit in character limit.

DavidTurner · July 9, 2020, 11:40am

Ok that's helpful, this is indeed a low-level EIO error returned from a read(2) call on your shared filesystem repository:

JacFal:

Caused by: java.io.IOException: Input/output error
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
	at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:48) ~[?:?]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
	at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:223) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[?:?]
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:252) ~[?:?]
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:292) ~[?:?]
	at java.io.BufferedInputStream.read(BufferedInputStream.java:351) ~[?:?]
	at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:92) ~[elasticsearch-7.0.1.jar:7.0.1]
	at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:?]
	at org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:69) ~[elasticsearch-7.0.1.jar:7.0.1]
	at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:?]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFile(FileRestoreContext.java:261) ~[elasticsearch-7.0.1.jar:7.0.1]

I'm guessing this is NFS? Not sure I can help much more there, but that's definitely not supposed to happen.

system · August 6, 2020, 11:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Not recovering all shards, why? Elasticsearch	3	342	July 6, 2017
Getting Elasticsearch cluster status as RED Elasticsearch elastic-stack-monitoring , docker	17	2778	June 14, 2022
Snapshot restore failed shards Elasticsearch snapshot-and-restore	1	1281	March 10, 2021
How to monitor restore from snapshot - failure case? Elasticsearch snapshot-and-restore	6	402	April 29, 2021
How to restart/recover a shard? Elasticsearch	24	1388	July 6, 2017

Check if restore operation is still running

Related topics