Hi, I'm facing issues to have a snapshot of my cluster completed.
For context, this cluster used to have nightly snapshots working as expected until late November 2023, however due to an incorrect setting on beats there's been a spike on data input which had the cluster disk usage increase a lot. Today it is edging 85%.
At the time the snapshots started failing the following warning was spotted repeatedly
{"type": "server", "timestamp": "2024-01-10T15:35:59,482Z", "level": "WARN", "component": "o.e.r.b.BlobStoreRepository", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "Your repository metadata blob for repository [redacted] is larger than 5MB. Consider moving to a fresh repository for new snapshots or deleting unneeded snapshots from your repository to ensure stable repository behavior going forward.", "cluster.uuid": "redacted", "node.id": "redacted" }
Due to this message we set up a new S3 bucket and updated the config so that ES would create snapshots again.
The behaviour we're seeing now is the snapshot creation process going forward for about 4 days and when it gets to the last shard it takes a really long time and it fails logging a stack trace that didn't give me much more indication of what could be the cause
{"type": "server", "timestamp": "2024-02-02T12:24:45,786Z", "level": "WARN", "component": "o.e.s.SnapshotsService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "[redacted:daily-snapshot-2024.01.29-spyjdkjbsi25jfkoetligg/MNCXSIcRT16ZQaaLKik0GA] failed to finalize snapshot", "cluster.uuid": "redacted", "node.id": "redacted" ,
"stacktrace": ["java.lang.NullPointerException: null",
"at java.util.Objects.requireNonNull(Objects.java:208) ~[?:?]",
"at java.util.ImmutableCollections$ListN.<init>(ImmutableCollections.java:530) ~[?:?]",
"at java.util.List.of(List.java:1045) ~[?:?]",
"at java.util.ImmutableCollections.listCopy(ImmutableCollections.java:148) ~[?:?]",
"at java.util.List.copyOf(List.java:1066) ~[?:?]",
"at org.elasticsearch.common.collect.List.copyOf(List.java:79) ~[elasticsearch-core-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.metadata.IndexAbstraction$DataStream.<init>(IndexAbstraction.java:310) ~[elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.metadata.Metadata$Builder.buildIndicesLookup(Metadata.java:1500) ~[elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.metadata.Metadata$Builder.build(Metadata.java:1467) ~[elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService.metadataForSnapshot(SnapshotsService.java:1017) ~[elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService.lambda$finalizeSnapshotEntry$36(SnapshotsService.java:1552) ~[elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:112) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:224) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:68) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.action.StepListener.whenComplete(StepListener.java:78) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService.finalizeSnapshotEntry(SnapshotsService.java:1549) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService.access$2100(SnapshotsService.java:127) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1468) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1465) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:1310) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService.endSnapshot(SnapshotsService.java:1465) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService.access$900(SnapshotsService.java:127) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.snapshots.SnapshotsService$16.clusterStateProcessed(SnapshotsService.java:3105) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.clusterStateProcessed(MasterService.java:534) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService$TaskOutputs.lambda$processedDifferentClusterState$1(MasterService.java:421) [elasticsearch-7.10.2.jar:7.10.2]",
"at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]",
"at org.elasticsearch.cluster.service.MasterService$TaskOutputs.processedDifferentClusterState(MasterService.java:421) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService.onPublicationSuccess(MasterService.java:281) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:273) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.10.2.jar:7.10.2]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]"] }
{"type": "server", "timestamp": "2024-02-02T12:24:45,928Z", "level": "ERROR", "component": "o.e.x.s.SnapshotLifecycleTask", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "failed to create snapshot for snapshot lifecycle policy [daily-snapshot]: java.lang.NullPointerException", "cluster.uuid": "redacted", "node.id": "redacted" }
Yes, we're aware 7.10.2 is EOL - we plan to upgrade to 7.17 but having a snapshot is one of the requirements for it.
Also, this is a massive cluster keeping a large amount of data in memory at the moment, pending a successful snapshot in order to change this retention... so we're a bit in a catch 22 situation.
I appreciate any suggestions to get this initial snapshot working.
Thanks