Snapshot exception

pmk · October 13, 2020, 7:48am

Hi

i'm trying to take snapshot for every 30 min ,
The following problems occur.
ERROR)
{
"type": "concurrent_snapshot_execution_exception",
"reason": "[reponm-prd-snapshots:scheduled-lm_rkid_qtoa75fr0qvlca] a snapshot is already running",
"stack_trace": "ConcurrentSnapshotExecutionException[[reponm-prd-snapshots:scheduled-lm_rkid_qtoa75fr0qvlca] a snapshot is already running]\n\tat org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:203)\n\tat org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)\n\tat org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702)\n\tat org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324)\n\tat org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219)\n\tat org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73)\n\tat org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)\n\tat org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)\n\tat org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:636)\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)\n\tat java.base/java.lang.Thread.run(Thread.java:832)\n"
}

The spec are the same.
elasticsearch 7.8.0
total node : 3 ea(master/data , master/data, master/data) clustering
{
  "nodes": {
    "sqMr-H3dRd6zqtcVysB-xg": {
      "name": "node-3"
    },
    "EvHTgdmkSF-JnTwhORAV-Q": {
      "name": "node-1"
    },
    "AUi5OleeTiKDGDhtYosJpw": {
      "name": "node-2"
    }
  }
}

I don't know what's wrong.
please help me !!

dadoonet · October 13, 2020, 9:20am

Welcome.

Please read this about how to format.

Here the error message seems to be clear:

a snapshot is already running

You should probably check before running the new snapshot that the previous one has finished. Or just "ignore" the error message and try again 30 minutes later.

pmk · October 14, 2020, 1:05am

Execution by schedule => Error
The current problem is the same problem even if the time interval is increased.
And snapshot execution is executed only by schedule.

dadoonet · October 14, 2020, 5:00am

I don't understand. Could you clarify?

pmk · October 14, 2020, 6:28am

I registered the schedule in snapshot &restore,
I just ran the processor according to the schedule.
But I get the same error as above.

Christian_Dahlqvist · October 14, 2020, 6:41am

If you want to run the snapshot every 30 minutes, should the cron schedule not look something like this 0,30 * * * * ? Note that if the snapshot does not complete within 30 minutes you are likely to see the same problem with this schedule.

Armin_Braun · October 14, 2020, 7:35am

Hi @pmk

one thing to add to what @Christian_Dahlqvist points out

Note that if the snapshot does not complete within 30 minutes you are likely to see the same > problem with this schedule.

This is a non-issue if you were to upgrade to v7.9 or later. We support fully concurrent snapshot operations from that version on. See:

github.com/elastic/elasticsearch

Enable Fully Concurrent Snapshot Operations

elastic:master ← original-brownbear:allow-multiple-snapshots

opened 01:43PM - 18 May 20 UTC

original-brownbear

+2867 -471

Enables fully concurrent snapshot operations: * Snapshot create- and delete ope…rations can be started in any order * Delete operations wait for snapshot finalization to finish, are batched as much as possible to improve efficiency and once enqueued in the cluster state prevent new snapshots from starting on data nodes until executed * We could be even more concurrent here in a follow-up by interleaving deletes and snapshots on a per-shard level. I decided not to do this for now since it seemed not worth the added complexity yet. Due to batching+deduplicating of deletes the pain of having a delete stuck behind a long -running snapshot seemed manageable (dropped client connections + resulting retries don't cause issues due to deduplication of delete jobs, batching of deletes allows enqueuing more and more deletes even if a snapshot blocks for a long time that will all be executed in essentially constant time (due to bulk snapshot deletion, deleting multiple snapshots is mostly about as fast as deleting a single one)) * Snapshot creation is completely concurrent across shards, but per shard snapshots are linearized for each repository as are snapshot finalizations See updated JavaDoc and added test cases for more details and illustration on the functionality. Some notes: The queuing of snapshot finalizations and deletes and the related locking/synchronization is a little awkward in this version but can be much simplified with some refactoring. The problem is that snapshot finalizations resolve their listeners on the `SNAPSHOT` pool while deletes resolve the listener on the master update thread. With some refactoring both of these could be moved to the master update thread, effectively removing the need for any synchronization around the `SnapshotService` state. I didn't do this refactoring here because it's a fairly large change and not necessary for the functionality but plan to do so in a follow-up. This change allows for completely removing any trickery around synchronizing deletes and snapshots from SLM and 100% does away with SLM errors from collisions between deletes and snapshots. Snapshotting a single index in parallel to a long running full backup will execute without having to wait for the long running backup as required by the ILM/SLM use case of moving indices to "snapshot tier". Finalizations are linearized but ordered according to which snapshot saw all of its shards complete first

pmk · October 14, 2020, 7:52am

duration : 170s
The schedule is changed from 30 min every day to once a day
But ERROR message is output

Armin_Braun · October 14, 2020, 7:58am

@pmk

But ERROR message is output

That is strange but I think there might be a bug here, we've had similar but never properly reproduced reports before.

Could you paste the logs around the error message maybe, including the part where the SnapshotsService logs the start of the snapshot that actually works out so I can take a look?

Thanks!

pmk · October 14, 2020, 8:02am

setting

error message)

        {
          "type": "concurrent_snapshot_execution_exception",
          "reason": "[reponm-prd-snapshots:scheduled-v6zsv9rxrhgazmcdhrwlig]  a snapshot is already running",
          "stack_trace": "ConcurrentSnapshotExecutionException[[reponm-prd-snapshots:scheduled-v6zsv9rxrhgazmcdhrwlig]  a snapshot is already running]\n\tat org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:203)\n\tat org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47)\n\tat org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702)\n\tat org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324)\n\tat org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219)\n\tat org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73)\n\tat org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151)\n\tat org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150)\n\tat org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:636)\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)\n\tat java.base/java.lang.Thread.run(Thread.java:832)\n"
        }

Armin_Braun · October 15, 2020, 8:09am

@pmk sorry I should have worded this more carefully.

What I'm looking for is full logs with timestamps that show both the ERROR for failing to start a snapshot but also the logs for when the concurrent snapshot that prevented the failing one to start happened. So everything between a line like this for the running snapshot:

[2020-10-15T10:06:51,149][INFO ][o.e.s.SnapshotsService   ] [node_s0] snapshot [test-repo:test-snap/88ZwRkUERZClvs2_0w4DQA] started

and it's corresponding completion log which looks like this:

[2020-10-15T10:06:51,574][INFO ][o.e.s.SnapshotsService   ] [node_s0] snapshot [test-repo:test-snap/88ZwRkUERZClvs2_0w4DQA] completed with state [SUCCESS]

would be ideal if possible.

Thanks again!

pmk · October 15, 2020, 8:51am

Due to security policy, files cannot be attached.

Armin_Braun · October 15, 2020, 3:41pm

Thanks @pmk

we observed this issue in another context as well today and think this is a bug. We're tracking the work on it in the below issue now

system · November 12, 2020, 3:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Snapshot exception Elasticsearch	4	3284	July 6, 2017
Using Java API multi-thread snapshot index come across concurrent snapshot execution exception Elasticsearch	3	522	September 17, 2019
Curator 4.2.6 showing Concurrent Snapshot Execution Exception? Elasticsearch	3	1084	July 13, 2017
Snapshot queue Elasticsearch	3	1165	December 7, 2017
Snapshot exception and the reason Elasticsearch	1	394	May 17, 2018

Snapshot exception

Related topics