Is there any way to monitor restore if wait_for_completion flag is set to false?
For now I use the GET _recovery API. I have few conditions for each shard in indices to check if restore operation is still running:
.source.snapshot == $restore_snapshot (or all the previous snapshots in the repository?)
.stop_time_in_millis != null
Success conditions:
.stage == "DONE"
.index.size.percent == "100.0%"
Is there any more sufficient way to do it? The documentation says about green/red cluster/indices state, but if something goes wrong and I will rely only on that condition, I will get the infinity loop. There is no way to set any timeout in my case as I store very huge amount of data in the database.
What about this? I executed restore on my environment. All the yellow and red indices switched to green after some time (probably at the end of restore). The same thing happened with cluster health state, it switched from green before restore to red during restore and again to green at the end.
Is it possible to check why it failed? Is retry executed only on indices that switched from yellow to red or all indices from Elasticsearch backup?
As restore can be saved by a retry and health status can switch from red to green during restore, depending only on the health status is not sufficient. Am I wrong?
Yes, here's an article about investigating a red cluster health:
The retry is per-shard; in recent versions there are more tenacious retries at lower levels too.
I'm not sure what you mean by "not sufficient": you should wait until the cluster health is either red or green, and if it goes red then you should investigate ASAP.
Investigation ASAP is not that needed if restore ends with success (green cluster state) after auto-retry, is it? If not then is it possible to check if there are no possible retries left - the restore is finished and cluster has constant red state? In my opinion this would be the final condition which requires manual investigation.
I think you should always investigate a red cluster during a restore. It might eventually sort itself out, but there's still something wrong if a restore fails part-way through and needs retrying.
The Allocation Explain API described in the article I shared above will tell you if a shard has got to the maximum number of retries.
Ok, I have following problems with allocating shards:
GET /_cluster/allocation/explain
{
"index": "index_with_red_status",
"shard": 0,
"primary": true
}
Response:
{
"index" : "index_with_red_status",
"shard" : 0,
"primary" : true,
"current_state" : "initializing",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2020-07-09T10:03:43.306Z",
"failed_allocation_attempts" : 1,
"details" : "failed shard on node [ymRg964ASrmkefI3uWbNrQ]: failed recovery, failure RecoveryFailedException[[index_with_red_status][0]: Recovery failed on {elasticsearch-data-3}{ymRg964ASrmkefI3uWbNrQ}{k53mojOORkqH4G7KUidQaw}{192.168.1.58}{192.168.1.58:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: IOException[Input/output error]; ",
"last_allocation_status" : "awaiting_info"
},
"current_node" : {
"id" : "Uq-XbuBaR7Glzakj47ikOQ",
"name" : "elasticsearch-data-7",
"transport_address" : "192.168.1.206:9300"
},
"explanation" : "the shard is in the process of initializing on node [elasticsearch-data-7], wait until initialization has completed"
}
It happens to random indices during snapshot restore. They are fixed after retry(ies), but could you please tell me what is the possible purpose of the failure? There are no problems with connections between the nodes, all the ES docker containers are up and running, without any restarts.
I think there will be more detail in the logs (either the master node or node ymRg964ASrmkefI3uWbNrQ) at around 2020-07-09T10:03:43.306Z. Can you provide that too?
The phrase Input/output error means EIO whose docs say "Usually used for physical read or write errors" so I think this indicates you have hardware issues. Check the kernel logs with dmesg on node ymRg964ASrmkefI3uWbNrQ for more information.
index_with_red_status][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [index_with_red_status][0]: Recovery failed on {elasticsearch-data-3}{ymRg964ASrmkefI3uWbNrQ}{k53mojOORkqH4G7KUidQaw}{192.168.1.58}{192.168.1.58:9300}
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2369) [elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.0.1.jar:7.0.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:353)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:495)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q]
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: Failed to recover index
Caused by: java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:48) ~[?:?]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:223) ~[?:?]
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) ~[?:?]
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) ~[?:?]
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[?:?]
at java.io.BufferedInputStream.fill(BufferedInputStream.java:252) ~[?:?]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:292) ~[?:?]
at java.io.BufferedInputStream.read(BufferedInputStream.java:351) ~[?:?]
at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:92) ~[elasticsearch-7.0.1.jar:7.0.1]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:?]
at org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:69) ~[elasticsearch-7.0.1.jar:7.0.1]
at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:?]
at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFile(FileRestoreContext.java:261) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFiles(FileRestoreContext.java:238) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.repositories.blobstore.FileRestoreContext.restore(FileRestoreContext.java:199) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:868) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:479) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$5(StoreRecovery.java:285) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:283) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1695) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2365) ~[elasticsearch-7.0.1.jar:7.0.1]
Master node:
failing shard [failed shard, shard [index_with_red_status][0], node[ymRg964ASrmkefI3uWbNrQ], [P], recovery_source[snapshot recovery [Hpx9Aq8uRH2pnNjFDTWyzw] from repo:es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q], s[INITIALIZING], a[id=M3RdzhCLQAGYnNFzRSeXKw], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2020-07-09T09:24:23.418Z], delayed=false, details[restore_source[repo/es-snapshot]], allocation_status[fetching_shard_data]], message [failed recovery], failure [RecoveryFailedException[[index_with_red_status][0]: Recovery failed on {elasticsearch-data-3}{ymRg964ASrmkefI3uWbNrQ}{k53mojOORkqH4G7KUidQaw}{192.168.1.58}{192.168.1.58:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q]]; nested: IndexShardRestoreFailedException[Failed to recover index]; nested: IOException[Input/output error]; ], markAsStale [true]]}
org.elasticsearch.indices.recovery.RecoveryFailedException: [index_with_red_status][0]: Recovery failed on {elasticsearch-data-3}{ymRg964ASrmkefI3uWbNrQ}{k53mojOORkqH4G7KUidQaw}{192.168.1.58}{192.168.1.58:9300}
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2369) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-7.0.1.jar:7.0.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:353)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:495)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [es-snapshot/ZiOqBRXCTsKZ9nMywzWC4Q]
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: Failed to recover index
Caused by: java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:48) ~[?:?]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:223) ~[?:?]
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) ~[?:?]
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) ~[?:?]
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[?:?]
at java.io.BufferedInputStream.fill(BufferedInputStream.java:252) ~[?:?]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:292) ~[?:?]
at java.io.BufferedInputStream.read(BufferedInputStream.java:351) ~[?:?]
at org.elasticsearch.index.snapshots.blobstore.SlicedInputStream.read(SlicedInputStream.java:92) ~[elasticsearch-7.0.1.jar:7.0.1]
at java.io.FilterInputStream.read(FilterInputStream.java:133) ~[?:?]
at org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:69) ~[elasticsearch-7.0.1.jar:7.0.1]
at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:?]
at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFile(FileRestoreContext.java:261) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.repositories.blobstore.FileRestoreContext.restoreFiles(FileRestoreContext.java:238) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.repositories.blobstore.FileRestoreContext.restore(FileRestoreContext.java:199) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.restoreShard(BlobStoreRepository.java:868) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:479) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$5(StoreRecovery.java:285) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:283) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1695) ~[elasticsearch-7.0.1.jar:7.0.1]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$10(IndexShard.java:2365) ~[elasticsearch-7.0.1.jar:7.0.1]
Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[index_with_red_status][0]] ...]).
Stack traces are reduced to fit in character limit.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.