Single SLM Policy Intermittently Generating Concurrency Exception

mstephenson · January 7, 2020, 4:26pm

We have created a Snapshot Lifecycle Management policy on our development cluster using Elasticsearch 7.5.0 to:

Create a snapshot every 12 minutes
Remove snapshots after 3 days

Most of the time the policy works as intended, however, occasionally we are getting an “org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException”.

The odd thing is that there is only one SLM policy for the cluster and the longest duration for a snapshot is 20 seconds. In other words, there should only be one snapshot request every 12 minutes and the snapshots are completing fast enough that there should be no overlap between policy executions.

The following is an excerpt from the elasticsearch log which suggests that the snapshot policy is being triggered twice:

[2020-01-06T14:23:59,996][INFO ][o.e.x.s.SnapshotLifecycleTask] [SERVERNAME-NODE01] snapshot lifecycle policy [local_rolling_policy] issuing create snapshot [local_rolling_20200106_142359-mpx_njmpskklgpcpxsfxbw]
[2020-01-06T14:23:59,996][INFO ][o.e.x.s.SnapshotLifecycleTask] [SERVERNAME-NODE01] snapshot lifecycle policy job [local_rolling_policy-1] issued new snapshot creation for [local_rolling_20200106_142359-mpx_njmpskklgpcpxsfxbw] successfully
[2020-01-06T14:24:00,001][INFO ][o.e.x.s.SnapshotLifecycleTask] [SERVERNAME-NODE01] snapshot lifecycle policy [local_rolling_policy] issuing create snapshot [local_rolling_20200106_142400-griorh8qqgatp_3ehuipog]
[2020-01-06T14:24:00,002][INFO ][o.e.x.s.SnapshotLifecycleTask] [SERVERNAME-NODE01] snapshot lifecycle policy job [local_rolling_policy-1] issued new snapshot creation for [local_rolling_20200106_142400-griorh8qqgatp_3ehuipog] successfully
[2020-01-06T14:24:00,016][INFO ][o.e.s.SnapshotsService ] [SERVERNAME-NODE01] snapshot [local_rolling:local_rolling_20200106_142359-mpx_njmpskklgpcpxsfxbw/kNMIVMlWRQ-vAILGhi9pjg] started
[2020-01-06T14:24:00,015][WARN ][o.e.s.SnapshotsService ] [SERVERNAME-NODE01] [local_rolling][local_rolling_20200106_142400-griorh8qqgatp_3ehuipog] failed to create snapshot
org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [local_rolling:local_rolling_20200106_142400-griorh8qqgatp_3ehuipog] a snapshot is already running
at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:304) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.5.0.jar:7.5.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
[2020-01-06T14:24:00,016][ERROR][o.e.x.s.SnapshotLifecycleTask] [SERVERNAME-NODE01] failed to create snapshot for snapshot lifecycle policy [local_rolling_policy]: ConcurrentSnapshotExecutionException[[local_rolling:local_rolling_20200106_142400-griorh8qqgatp_3ehuipog] a snapshot is already running]
[2020-01-06T14:24:00,881][INFO ][o.e.s.SnapshotsService ] [SERVERNAME-NODE01] snapshot [local_rolling:local_rolling_20200106_142359-mpx_njmpskklgpcpxsfxbw/kNMIVMlWRQ-vAILGhi9pjg] completed with state [SUCCESS]

We would appreciate any help in resolving the problem. Also, if this needs to be posted elsewhere feel free to let me know.

Test Environment:

We are seeing the exceptions on our development machines which are running a single node cluster on Windows 10 Enterprise (Version 1903) with OpenJDK 11.0.2, Elasticsearch 7.5.0 and Kibana 7.5.0. Snapshots are being created to local disk.

Steps to Reproduce:

Create the repository

PUT /_snapshot/local_rolling
{
  "type": "fs",
  "settings": {
    "location": "C:\\Elasticsearch\\backup\\local_rolling"
  }
}

Create the Snapshot Lifecycle Management Policy

PUT /_slm/policy/local_rolling_policy
{
  "schedule": "0 */12 * ? * *",
  "name": "<local_rolling_{now/s{yyyyMMdd_HHmmss|America/Toronto}}>",
  "repository": "local_rolling",
  "config": {
    "indices": ["*"],
    "ignore_unavailable": false,
    "include_global_state": true
  },
  "retention": {
        "expire_after": "3d"
  }
}

Ensure that SLM Policy has been started

The problem is intermittent; however, we typically have multiple instances of the exception each day.

The exception messages can be seen by:

Checking the Elasticsearch log files
Checking the “last_failure” property on the SLM policy

system · February 4, 2020, 4:28pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multiple SLM Policies, same Snapshot Repository and same Elasticsearch Cluster? Elasticsearch slm-snapshot-lifecycle-management , snapshot-and-restore	1	399	February 1, 2023
SLM Policy Not Being Followed? Elasticsearch slm-snapshot-lifecycle-management , snapshot-and-restore	3	425	October 29, 2021
SLM: Only one snapshot can be taken at a time. To avoid snapshot failures, edit or delete the policies Elasticsearch	5	582	June 17, 2021
Kibana stills warns about snapshots with the same Schedule Kibana slm-snapshot-lifecycle-management	3	306	December 29, 2022
SLM policy for elastic snapshot for 30 mins is failing Elasticsearch	1	393	July 14, 2020

Single SLM Policy Intermittently Generating Concurrency Exception

Related topics