Snapshot restore failed shards

I am setting up an automated process to test my latest elasticsearch snapshots. The snapshots are taken using slm and the snapshot data is synced from the production cluster to a local server.

On the local server I have a bash script that starts a docker container with the snapshot data mounted and a python script that does all the api requests to elasticsearch.

The snapshot size is around 2TB so it takes a while to restore all the data hence why I want to automate this process.

This is the post request I use to restore the snapshot.

curl -XPOST "http://localhost:9200/_snapshot/my_backup/snap-2021.01.25-xgavgjpmtfaor0rdugq7jq/_restore?wait_for_completion=true" -H 'Content-Type: application/json' -d'
{
    "indices": "*",
    "ignore_unavailable": "true",
    "include_global_state": "false",
    "include_aliases": "false"
}'

After the wait_for_completion is done the script sends me some of the stats from the restore response and there are always failed shards.

Total shards:654
Successful shards:633
Failed shards:21
Troubleshooting
  1. I have tried to create a cluster of three docker containers as that mirrors the production environment but that kept crashing with exit code 137 (out of RAM).
  2. I then tried with two nodes in the cluster which made the restore take more than 2 days and still had failed shards
  3. I got a list of all restored indices using curl -XGET localhost:9200/_cat/indices. I then got a list of all indices in the snapshot using curl -XGET "localhost:9200/_snapshot/my_backup/snap-2021.01.25-xgavgjpmtfaor0rdugq7jq?pretty". I created two text files with just the names of indices and ran diff on them to get a list of which indices are missing in the restore.

diff restore-indices.log snapshot-indices.log

List of non restored indices
> ilm-history-2-000001
> ilm-history-2-000002
> ilm-history-2-000003
> .monitoring-beats-7-2021.01.19
> .monitoring-beats-7-2021.01.20
> .monitoring-beats-7-2021.01.21
> .monitoring-beats-7-2021.01.22
> .monitoring-beats-7-2021.01.23
> .monitoring-beats-7-2021.01.24
> .monitoring-beats-7-2021.01.25
> .monitoring-es-7-2021.01.19
> .monitoring-es-7-2021.01.20
> .monitoring-es-7-2021.01.21
> .monitoring-es-7-2021.01.22
> .monitoring-es-7-2021.01.23
> .monitoring-es-7-2021.01.24
> .monitoring-es-7-2021.01.25
> .monitoring-kibana-7-2021.01.19
> .monitoring-kibana-7-2021.01.20
> .monitoring-kibana-7-2021.01.21
> .monitoring-kibana-7-2021.01.22
> .monitoring-kibana-7-2021.01.23
> .monitoring-kibana-7-2021.01.24
> .monitoring-kibana-7-2021.01.25
> .slm-history-2-000001
> .slm-history-2-000002
> .slm-history-2-000003

The indices that didn't get restored are not important to my restore but I want my script to return with Failed shards: 0 so I don't have to manually check if there are any important failed indices each time the script runs.

Trying to find a way to restore these indices or find a better way to check the status of the restore.

Log snippet

Log file is large so only posting a snippet. Can post more if needed.

{"type": "server", "timestamp": "2021-02-09T12:03:59,392Z", "level": "INFO", "component": "o.e.x.i.IndexLifecycleRunner", "cluster.name": "docker-cluster", "node.name": "cc118b7d73ff", "message": "policy [slm-history-ilm-policy] for index [.slm-history-2-000003] on an error step due to a transient error, moving back to the failed step [check-rollover-ready] for execution. retry attempt [3]", "cluster.uuid": "vaBZvVfWRhC-9gp74gBftw", "node.id": "Y8GqxVMSTIWMcHb13Epzqg"  }
{"type": "server", "timestamp": "2021-02-09T12:03:59,393Z", "level": "ERROR", "component": "o.e.x.i.IndexLifecycleRunner", "cluster.name": "docker-cluster", "node.name": "cc118b7d73ff", "message": "policy [ilm-history-ilm-policy] for index [ilm-history-2-000003] failed on step [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]. Moving to ERROR step", "cluster.uuid": "vaBZvVfWRhC-9gp74gBftw", "node.id": "Y8GqxVMSTIWMcHb13Epzqg" , 
"stacktrace": ["java.lang.IllegalArgumentException: index.lifecycle.rollover_alias [ilm-history-2] does not point to index [ilm-history-2-000003]",
"at org.elasticsearch.xpack.core.ilm.WaitForRolloverReadyStep.evaluateCondition(WaitForRolloverReadyStep.java:114) [x-pack-core-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleRunner.runPeriodicStep(IndexLifecycleRunner.java:174) [x-pack-ilm-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:327) [x-pack-ilm-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggered(IndexLifecycleService.java:265) [x-pack-ilm-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:183) [x-pack-core-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:216) [x-pack-core-7.10.2.jar:7.10.2]",
"at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]",
"at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]",
"at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]"] }
{"type": "server", "timestamp": "2021-02-09T12:03:59,395Z", "level": "WARN", "component": "o.e.x.i.IndexLifecycleService", "cluster.name": "docker-cluster", "node.name": "cc118b7d73ff", "message": "async action execution failed during policy trigger for index [filebeat-7.6.2-2020.05.09-000002] with policy [filebeat] in step [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"ERROR\"}]", "cluster.uuid": "vaBZvVfWRhC-9gp74gBftw", "node.id": "Y8GqxVMSTIWMcHb13Epzqg" , 
"stacktrace": ["java.lang.IllegalStateException: unable to parse steps for policy [filebeat] as it doesn't exist",
"at org.elasticsearch.xpack.ilm.PolicyStepsRegistry.parseStepsFromPhase(PolicyStepsRegistry.java:146) ~[x-pack-ilm-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.ilm.PolicyStepsRegistry.getStep(PolicyStepsRegistry.java:203) ~[x-pack-ilm-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleRunner.onErrorMaybeRetryFailedStep(IndexLifecycleRunner.java:204) ~[x-pack-ilm-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleRunner.runPeriodicStep(IndexLifecycleRunner.java:155) ~[x-pack-ilm-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:327) [x-pack-ilm-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggered(IndexLifecycleService.java:265) [x-pack-ilm-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:183) [x-pack-core-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:216) [x-pack-core-7.10.2.jar:7.10.2]",
"at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]",
"at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]",
"at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]"] }
{"type": "server", "timestamp": "2021-02-09T12:13:59,392Z", "level": "ERROR", "component": "o.e.x.i.IndexLifecycleRunner", "cluster.name": "docker-cluster", "node.name": "cc118b7d73ff", "message": "policy [slm-history-ilm-policy] for index [.slm-history-2-000003] failed on step [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]. Moving to ERROR step", "cluster.uuid": "vaBZvVfWRhC-9gp74gBftw", "node.id": "Y8GqxVMSTIWMcHb13Epzqg" , 
"stacktrace": ["java.lang.IllegalArgumentException: index.lifecycle.rollover_alias [.slm-history-2] does not point to index [.slm-history-2-000003]",
"at org.elasticsearch.xpack.core.ilm.WaitForRolloverReadyStep.evaluateCondition(WaitForRolloverReadyStep.java:114) [x-pack-core-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleRunner.runPeriodicStep(IndexLifecycleRunner.java:174) [x-pack-ilm-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:327) [x-pack-ilm-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggered(IndexLifecycleService.java:265) [x-pack-ilm-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:183) [x-pack-core-7.10.2.jar:7.10.2]",
"at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:216) [x-pack-core-7.10.2.jar:7.10.2]",
"at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]",
"at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]",
"at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]"] }