Understanding IndexShardSnapshotFailedException

I have several snapshot policy, most of them were success to execute and create snapshot, but sometimes I got some shard failed with message "INTERNAL_SERVER_ERROR: Aborted", when I look at the snapshot policy history, I got these details but I still can't figure out why the snapshot was aborted

{
  "type": "snapshot_exception",
  "reason": "[repo:logs-2021.01.13-sh6k2jw2qn2kucp2sy_ryw] failed to create snapshot successfully, 1 out of 16 total shards failed",
  "stack_trace": "SnapshotException[[repo:logs-2021.01.13-sh6k2jw2qn2kucp2sy_ryw] failed to create snapshot successfully, 1 out of 16 total shards failed]\n\tat org.elasticsearch.xpack.slm.SnapshotLifecycleTask$1.onResponse(SnapshotLifecycleTask.java:109)\n\tat org.elasticsearch.xpack.slm.SnapshotLifecycleTask$1.onResponse(SnapshotLifecycleTask.java:91)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)\n\tat org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:89)\n\tat org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:83)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)\n\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:89)\n\tat org.elasticsearch.action.ActionListener$4.onResponse(ActionListener.java:163)\n\tat org.elasticsearch.action.ActionListener$4.onResponse(ActionListener.java:163)\n\tat org.elasticsearch.action.ActionListener.onResponse(ActionListener.java:212)\n\tat org.elasticsearch.snapshots.SnapshotsService.completeListenersIgnoringException(SnapshotsService.java:2264)\n\tat org.elasticsearch.snapshots.SnapshotsService.lambda$finalizeSnapshotEntry$11(SnapshotsService.java:1253)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$finalizeSnapshot$37(BlobStoreRepository.java:1049)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)\n\tat org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58)\n\tat org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710)\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)\n\tat java.base/java.lang.Thread.run(Thread.java:832)\n\tSuppressed: [logs-2021.01.13-000085/2Ih_m4ThT8WZ_wSjdyuLjw][[logs-2021.01.13-000085][6]] IndexShardSnapshotFailedException[aborted]\n\t\tat org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:77)\n\t\tat org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:65)\n\t\tat org.elasticsearch.snapshots.SnapshotsService.finalizeSnapshotEntry(SnapshotsService.java:1229)\n\t\tat org.elasticsearch.snapshots.SnapshotsService.access$2000(SnapshotsService.java:120)\n\t\tat org.elasticsearch.snapshots.SnapshotsService$5.onResponse(SnapshotsService.java:1177)\n\t\tat org.elasticsearch.snapshots.SnapshotsService$5.onResponse(SnapshotsService.java:1174)\n\t\tat org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:1241)\n\t\tat org.elasticsearch.snapshots.SnapshotsService.endSnapshot(SnapshotsService.java:1174)\n\t\tat org.elasticsearch.snapshots.SnapshotsService.access$1000(SnapshotsService.java:120)\n\t\tat org.elasticsearch.snapshots.SnapshotsService$14.clusterStateProcessed(SnapshotsService.java:2551)\n\t\tat org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.clusterStateProcessed(MasterService.java:534)\n\t\tat org.elasticsearch.cluster.service.MasterService$TaskOutputs.lambda$processedDifferentClusterState$1(MasterService.java:421)\n\t\tat java.base/java.util.ArrayList.forEach(ArrayList.java:1510)\n\t\tat org.elasticsearch.cluster.service.MasterService$TaskOutputs.processedDifferentClusterState(MasterService.java:421)\n\t\tat org.elasticsearch.cluster.service.MasterService.onPublicationSuccess(MasterService.java:281)\n\t\tat org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:273)\n\t\tat org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250)\n\t\tat org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73)\n\t\tat org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)\n\t\tat org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)\n\t\tat org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)\n\t\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651)\n\t\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)\n\t\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)\n\t\t... 3 more\n",
  "suppressed": [
    {
      "type": "index_shard_snapshot_failed_exception",
      "reason": "aborted",
      "index_uuid": "2Ih_m4ThT8WZ_wSjdyuLjw",
      "shard": "6",
      "index": "logs-2021.01.13-000085",
      "stack_trace": "[logs-2021.01.13-000085/2Ih_m4ThT8WZ_wSjdyuLjw][[logs-2021.01.13-000085][6]] IndexShardSnapshotFailedException[aborted]\n\tat org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:77)\n\tat org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:65)\n\tat org.elasticsearch.snapshots.SnapshotsService.finalizeSnapshotEntry(SnapshotsService.java:1229)\n\tat org.elasticsearch.snapshots.SnapshotsService.access$2000(SnapshotsService.java:120)\n\tat org.elasticsearch.snapshots.SnapshotsService$5.onResponse(SnapshotsService.java:1177)\n\tat org.elasticsearch.snapshots.SnapshotsService$5.onResponse(SnapshotsService.java:1174)\n\tat org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:1241)\n\tat org.elasticsearch.snapshots.SnapshotsService.endSnapshot(SnapshotsService.java:1174)\n\tat org.elasticsearch.snapshots.SnapshotsService.access$1000(SnapshotsService.java:120)\n\tat org.elasticsearch.snapshots.SnapshotsService$14.clusterStateProcessed(SnapshotsService.java:2551)\n\tat org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.clusterStateProcessed(MasterService.java:534)\n\tat org.elasticsearch.cluster.service.MasterService$TaskOutputs.lambda$processedDifferentClusterState$1(MasterService.java:421)\n\tat java.base/java.util.ArrayList.forEach(ArrayList.java:1510)\n\tat org.elasticsearch.cluster.service.MasterService$TaskOutputs.processedDifferentClusterState(MasterService.java:421)\n\tat org.elasticsearch.cluster.service.MasterService.onPublicationSuccess(MasterService.java:281)\n\tat org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:273)\n\tat org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250)\n\tat org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73)\n\tat org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)\n\tat org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)\n\tat org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651)\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)\n\tat java.base/java.lang.Thread.run(Thread.java:832)\n"
    }
  ]
}

I'm currently using elasticsearch 7.9.1 and at the time this snapshot was taken, there are no other snapshot executed

Thank you

Hi @Kambing

the abort will have either come from a manual delete request or from SLM deleting the snapshot before it completed. You should be able to figure out which of the two happened here by checking your logs from the time that the shard snapshot failed.
SLM will log that it's doing a delete run as well as which snapshots it deletes via the logging from class org.elasticsearch.xpack.slm.SnapshotRetentionTask. If you can't see SLM executing the delete then it should be coming from a manual snapshot delete request.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.