Single SLM Policy Intermittently Generating Concurrency Exception

We have created a Snapshot Lifecycle Management policy on our development cluster using Elasticsearch 7.5.0 to:

  1. Create a snapshot every 12 minutes
  2. Remove snapshots after 3 days

Most of the time the policy works as intended, however, occasionally we are getting an “org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException”.

The odd thing is that there is only one SLM policy for the cluster and the longest duration for a snapshot is 20 seconds. In other words, there should only be one snapshot request every 12 minutes and the snapshots are completing fast enough that there should be no overlap between policy executions.

The following is an excerpt from the elasticsearch log which suggests that the snapshot policy is being triggered twice:

[2020-01-06T14:23:59,996][INFO ][o.e.x.s.SnapshotLifecycleTask] [SERVERNAME-NODE01] snapshot lifecycle policy [local_rolling_policy] issuing create snapshot [local_rolling_20200106_142359-mpx_njmpskklgpcpxsfxbw]
[2020-01-06T14:23:59,996][INFO ][o.e.x.s.SnapshotLifecycleTask] [SERVERNAME-NODE01] snapshot lifecycle policy job [local_rolling_policy-1] issued new snapshot creation for [local_rolling_20200106_142359-mpx_njmpskklgpcpxsfxbw] successfully
[2020-01-06T14:24:00,001][INFO ][o.e.x.s.SnapshotLifecycleTask] [SERVERNAME-NODE01] snapshot lifecycle policy [local_rolling_policy] issuing create snapshot [local_rolling_20200106_142400-griorh8qqgatp_3ehuipog]
[2020-01-06T14:24:00,002][INFO ][o.e.x.s.SnapshotLifecycleTask] [SERVERNAME-NODE01] snapshot lifecycle policy job [local_rolling_policy-1] issued new snapshot creation for [local_rolling_20200106_142400-griorh8qqgatp_3ehuipog] successfully
[2020-01-06T14:24:00,016][INFO ][o.e.s.SnapshotsService ] [SERVERNAME-NODE01] snapshot [local_rolling:local_rolling_20200106_142359-mpx_njmpskklgpcpxsfxbw/kNMIVMlWRQ-vAILGhi9pjg] started
[2020-01-06T14:24:00,015][WARN ][o.e.s.SnapshotsService ] [SERVERNAME-NODE01] [local_rolling][local_rolling_20200106_142400-griorh8qqgatp_3ehuipog] failed to create snapshot
org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [local_rolling:local_rolling_20200106_142400-griorh8qqgatp_3ehuipog] a snapshot is already running
at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:304) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.5.0.jar:7.5.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
[2020-01-06T14:24:00,016][ERROR][o.e.x.s.SnapshotLifecycleTask] [SERVERNAME-NODE01] failed to create snapshot for snapshot lifecycle policy [local_rolling_policy]: ConcurrentSnapshotExecutionException[[local_rolling:local_rolling_20200106_142400-griorh8qqgatp_3ehuipog] a snapshot is already running]
[2020-01-06T14:24:00,881][INFO ][o.e.s.SnapshotsService ] [SERVERNAME-NODE01] snapshot [local_rolling:local_rolling_20200106_142359-mpx_njmpskklgpcpxsfxbw/kNMIVMlWRQ-vAILGhi9pjg] completed with state [SUCCESS]

We would appreciate any help in resolving the problem. Also, if this needs to be posted elsewhere feel free to let me know.

Test Environment:

We are seeing the exceptions on our development machines which are running a single node cluster on Windows 10 Enterprise (Version 1903) with OpenJDK 11.0.2, Elasticsearch 7.5.0 and Kibana 7.5.0. Snapshots are being created to local disk.

Steps to Reproduce:

  1. Create the repository
PUT /_snapshot/local_rolling
{
  "type": "fs",
  "settings": {
    "location": "C:\\Elasticsearch\\backup\\local_rolling"
  }
}
  1. Create the Snapshot Lifecycle Management Policy
PUT /_slm/policy/local_rolling_policy
{
  "schedule": "0 */12 * ? * *",
  "name": "<local_rolling_{now/s{yyyyMMdd_HHmmss|America/Toronto}}>",
  "repository": "local_rolling",
  "config": {
    "indices": ["*"],
    "ignore_unavailable": false,
    "include_global_state": true
  },
  "retention": {
        "expire_after": "3d"
  }
} 
  1. Ensure that SLM Policy has been started

The problem is intermittent; however, we typically have multiple instances of the exception each day.

The exception messages can be seen by:

  1. Checking the Elasticsearch log files
  2. Checking the “last_failure” property on the SLM policy

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.