Snapshot doesn't complete with "failed to finalize snapshot"

Hi, I'm facing issues to have a snapshot of my cluster completed.
For context, this cluster used to have nightly snapshots working as expected until late November 2023, however due to an incorrect setting on beats there's been a spike on data input which had the cluster disk usage increase a lot. Today it is edging 85%.

At the time the snapshots started failing the following warning was spotted repeatedly

{"type": "server", "timestamp": "2024-01-10T15:35:59,482Z", "level": "WARN", "component": "o.e.r.b.BlobStoreRepository", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "Your repository metadata blob for repository [redacted] is larger than 5MB. Consider moving to a fresh repository for new snapshots or deleting unneeded snapshots from your repository to ensure stable repository behavior going forward.", "cluster.uuid": "redacted", "node.id": "redacted"  }

Due to this message we set up a new S3 bucket and updated the config so that ES would create snapshots again.

The behaviour we're seeing now is the snapshot creation process going forward for about 4 days and when it gets to the last shard it takes a really long time and it fails logging a stack trace that didn't give me much more indication of what could be the cause

{"type": "server", "timestamp": "2024-02-02T12:24:45,786Z", "level": "WARN", "component": "o.e.s.SnapshotsService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "[redacted:daily-snapshot-2024.01.29-spyjdkjbsi25jfkoetligg/MNCXSIcRT16ZQaaLKik0GA] failed to finalize snapshot", "cluster.uuid": "redacted", "node.id": "redacted" , 
"stacktrace": ["java.lang.NullPointerException: null",
"at java.util.Objects.requireNonNull(Objects.java:208) ~[?:?]",
"at java.util.ImmutableCollections$ListN.<init>(ImmutableCollections.java:530) ~[?:?]",
"at java.util.List.of(List.java:1045) ~[?:?]",
"at java.util.ImmutableCollections.listCopy(ImmutableCollections.java:148) ~[?:?]",
"at java.util.List.copyOf(List.java:1066) ~[?:?]",
"at org.elasticsearch.common.collect.List.copyOf(List.java:79) ~[elasticsearch-core-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.metadata.IndexAbstraction$DataStream.<init>(IndexAbstraction.java:310) ~[elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.metadata.Metadata$Builder.buildIndicesLookup(Metadata.java:1500) ~[elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.metadata.Metadata$Builder.build(Metadata.java:1467) ~[elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService.metadataForSnapshot(SnapshotsService.java:1017) ~[elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService.lambda$finalizeSnapshotEntry$36(SnapshotsService.java:1552) ~[elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:112) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:224) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:68) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.action.StepListener.whenComplete(StepListener.java:78) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService.finalizeSnapshotEntry(SnapshotsService.java:1549) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService.access$2100(SnapshotsService.java:127) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1468) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1465) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:1310) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService.endSnapshot(SnapshotsService.java:1465) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService.access$900(SnapshotsService.java:127) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService$16.clusterStateProcessed(SnapshotsService.java:3105) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.clusterStateProcessed(MasterService.java:534) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService$TaskOutputs.lambda$processedDifferentClusterState$1(MasterService.java:421) [elasticsearch-7.10.2.jar:7.10.2]",
"at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]",
"at org.elasticsearch.cluster.service.MasterService$TaskOutputs.processedDifferentClusterState(MasterService.java:421) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService.onPublicationSuccess(MasterService.java:281) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:273) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.10.2.jar:7.10.2]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]"] }
{"type": "server", "timestamp": "2024-02-02T12:24:45,928Z", "level": "ERROR", "component": "o.e.x.s.SnapshotLifecycleTask", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "failed to create snapshot for snapshot lifecycle policy [daily-snapshot]: java.lang.NullPointerException", "cluster.uuid": "redacted", "node.id": "redacted"  }

Yes, we're aware 7.10.2 is EOL - we plan to upgrade to 7.17 but having a snapshot is one of the requirements for it.

Also, this is a massive cluster keeping a large amount of data in memory at the moment, pending a successful snapshot in order to change this retention... so we're a bit in a catch 22 situation.

I appreciate any suggestions to get this initial snapshot working.
Thanks

So this helps anyone in a similar position in the future - since the cluster was so huge and especially the initial snapshot was taking so long we decided to separate our snapshot policies into various smaller runs, alphabetically split, throughout the week, so instead of one big snapshot of the whole thing we're now taking a couple per night.
Slicing them in smaller chunks worked fine and solved this issue.

2 Likes

The stack trace suggests there was something unexpected in your cluster metadata in the vicinity of the definition of one or more data streams. I mean it's definitely a bug to hit this, you did nothing wrong, but that's where it suggests the problem lies. I don't remember any bugs in this area off the top of my head, but it has been many years and the code is very different today so it's not easy to pin down the specific problem any more.

In any case thanks for reporting back that you found a workaround. I imagine that snapshotting subsets of the indices somehow misses out the unexpected metadata. Good luck with the upgrade!