Check if restore operation is still running

Hello,

Is there any way to monitor restore if wait_for_completion flag is set to false?
For now I use the GET _recovery API. I have few conditions for each shard in indices to check if restore operation is still running:

  • .source.snapshot == $restore_snapshot (or all the previous snapshots in the repository?)
  • .stop_time_in_millis != null

Success conditions:

  • .stage == "DONE"
  • .index.size.percent == "100.0%"

Is there any more sufficient way to do it? The documentation says about green/red cluster/indices state, but if something goes wrong and I will rely only on that condition, I will get the infinity loop. There is no way to set any timeout in my case as I store very huge amount of data in the database.

Thanks,
Jacek

Use the cluster health API: the cluster health will be yellow during the restore and will eventually go green (on success) or red (on failure).

According to the documentation
"When restore operation is executed the cluster typically goes into red state."

Checked. During restore most of the indices have yellow state, but there are few of them which have red state.

Ah, thanks for pointing that out. That hasn't been true for a long time; I opened https://github.com/elastic/elasticsearch/pull/59015/files to fix those docs.

What about this? I executed restore on my environment. All the yellow and red indices switched to green after some time (probably at the end of restore). The same thing happened with cluster health state, it switched from green before restore to red during restore and again to green at the end.

That means that something went wrong during the restore, but it was saved by a retry.

Is it possible to check why it failed? Is retry executed only on indices that switched from yellow to red or all indices from Elasticsearch backup?
As restore can be saved by a retry and health status can switch from red to green during restore, depending only on the health status is not sufficient. Am I wrong?

Yes, here's an article about investigating a red cluster health:

The retry is per-shard; in recent versions there are more tenacious retries at lower levels too.

I'm not sure what you mean by "not sufficient": you should wait until the cluster health is either red or green, and if it goes red then you should investigate ASAP.

1 Like

Investigation ASAP is not that needed if restore ends with success (green cluster state) after auto-retry, is it? If not then is it possible to check if there are no possible retries left - the restore is finished and cluster has constant red state? In my opinion this would be the final condition which requires manual investigation.

I think you should always investigate a red cluster during a restore. It might eventually sort itself out, but there's still something wrong if a restore fails part-way through and needs retrying.

The Allocation Explain API described in the article I shared above will tell you if a shard has got to the maximum number of retries.

Ok, I have following problems with allocating shards:

GET /_cluster/allocation/explain
{ 
   "index": "index_with_red_status", 
   "shard": 0, 
   "primary": true
}

Response:

{
  "index" : "index_with_red_status",
  "shard" : 0,
  "primary" : true,
  "current_state" : "initializing",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-07-09T10:03:43.306Z",
    "failed_allocation_attempts" : 1,
    "details" : "failed shard on node [ymRg964ASrmkefI3uWbNrQ]: failed recovery, failure RecoveryFailedException[[index_with_red_status][0]: Recovery failed on {elasticsearch-data-3}{ymRg964ASrmkefI3uWbNrQ}{k53mojOORkqH4G7KUidQaw}{192.168.1.58}{192.168.1.58:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: IOException[Input/output error]; ",
    "last_allocation_status" : "awaiting_info"
  },
  "current_node" : {
    "id" : "Uq-XbuBaR7Glzakj47ikOQ",
    "name" : "elasticsearch-data-7",
    "transport_address" : "192.168.1.206:9300"
  },
  "explanation" : "the shard is in the process of initializing on node [elasticsearch-data-7], wait until initialization has completed"
}

It happens to random indices during snapshot restore. They are fixed after retry(ies), but could you please tell me what is the possible purpose of the failure? There are no problems with connections between the nodes, all the ES docker containers are up and running, without any restarts.

I think there will be more detail in the logs (either the master node or node ymRg964ASrmkefI3uWbNrQ) at around 2020-07-09T10:03:43.306Z. Can you provide that too?

The phrase Input/output error means EIO whose docs say "Usually used for physical read or write errors" so I think this indicates you have hardware issues. Check the kernel logs with dmesg on node ymRg964ASrmkefI3uWbNrQ for more information.

Logs from ~2020-07-09T10:03:43.306Z

Data node:

index_with_red_status][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [index_with_red_status][0]: Recovery failed on {elasticsearch-data-3}{ymRg964ASrmkefI3uWbNrQ}{k53mojOORkqH4G7KUidQaw}{192.168.1.58}{192.168.1.58:9300}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2369) [elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.0.1.jar:7.0.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:353) 
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
	at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:495) 
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q]
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: Failed to recover index
Caused by: java.io.IOException: Input/output error
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
	at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:48) ~[?:?]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
	at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:223) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[?:?]
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:252) ~[?:?]
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:292) ~[?:?]
	at java.io.BufferedInputStream.read(BufferedInputStream.java:351) ~[?:?]
	at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:92) ~[elasticsearch-7.0.1.jar:7.0.1]
	at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:?]
	at org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:69) ~[elasticsearch-7.0.1.jar:7.0.1]
	at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:?]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFile(FileRestoreContext.java:261) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFiles(FileRestoreContext.java:238) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restore(FileRestoreContext.java:199) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:868) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:479) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$5(StoreRecovery.java:285) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:283) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1695) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2365) ~[elasticsearch-7.0.1.jar:7.0.1]

Master node:

failing shard [failed shard, shard [index_with_red_status][0], node[ymRg964ASrmkefI3uWbNrQ], [P], recovery_source[snapshot recovery [Hpx9Aq8uRH2pnNjFDTWyzw] from repo:es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q], s[INITIALIZING], a[id=M3RdzhCLQAGYnNFzRSeXKw], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2020-07-09T09:24:23.418Z], delayed=false, details[restore_source[repo/es-snapshot]], allocation_status[fetching_shard_data]], message [failed recovery], failure [RecoveryFailedException[[index_with_red_status][0]: Recovery failed on {elasticsearch-data-3}{ymRg964ASrmkefI3uWbNrQ}{k53mojOORkqH4G7KUidQaw}{192.168.1.58}{192.168.1.58:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: IOException[Input/output error]; ], markAsStale [true]]}
org.elasticsearch.indices.recovery.RecoveryFailedException: [index_with_red_status][0]: Recovery failed on {elasticsearch-data-3}{ymRg964ASrmkefI3uWbNrQ}{k53mojOORkqH4G7KUidQaw}{192.168.1.58}{192.168.1.58:9300}
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2369) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-7.0.1.jar:7.0.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:353) 
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
	at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:495)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q]
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: Failed to recover index
Caused by: java.io.IOException: Input/output error
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
	at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:48) ~[?:?]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
	at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:223) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) ~[?:?]
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[?:?]
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:252) ~[?:?]
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:292) ~[?:?]
	at java.io.BufferedInputStream.read(BufferedInputStream.java:351) ~[?:?]
	at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:92) ~[elasticsearch-7.0.1.jar:7.0.1]
	at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:?]
	at org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:69) ~[elasticsearch-7.0.1.jar:7.0.1]
	at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:?]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFile(FileRestoreContext.java:261) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFiles(FileRestoreContext.java:238) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.repositories.blobstore.FileRestoreContext.restore(FileRestoreContext.java:199) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:868) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:479) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$5(StoreRecovery.java:285) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:283) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1695) ~[elasticsearch-7.0.1.jar:7.0.1]
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2365) ~[elasticsearch-7.0.1.jar:7.0.1]

Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[index_with_red_status][0]] ...]).

Stack traces are reduced to fit in character limit.

Ok that's helpful, this is indeed a low-level EIO error returned from a read(2) call on your shared filesystem repository:

I'm guessing this is NFS? Not sure I can help much more there, but that's definitely not supposed to happen.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.