Snapshot Retention Task not deleting expired snapshots

For some reason, I cannot get the snapshot retention policy to delete old snapshots in one of my clusters.

I successfully set up SLM on a test, standalone node ingesting a regular but small amount of data. The retention task was deleting the expired snapshots as expected with normal-looking logs.

$ grep SnapshotRetentionTask /var/log/elasticsearch/elasticsearch.log -b4
...
1365314:[2020-01-21T16:00:00,002][INFO ][o.e.x.s.SnapshotRetentionTask] [node-1] starting SLM retention snapshot cleanup task
1365432-[2020-01-21T16:00:00,003][INFO ][o.e.x.s.SnapshotLifecycleTask] [node-1] snapshot lifecycle policy [fast-snapshots] issuing create snapshot [test-snap-aaphnvyiqvu680b4no2oqg]
1365607-[2020-01-21T16:00:00,006][INFO ][o.e.x.s.SnapshotLifecycleTask] [node-1] snapshot lifecycle policy job [fast-snapshots-1] issued new snapshot creation for [test-snap-aaphnvyiqvu680b4no2oqg] successfully
1365810-[2020-01-21T16:00:00,094][INFO ][o.e.s.SnapshotsService   ] [node-1] snapshot [my_backup:test-snap-aaphnvyiqvu680b4no2oqg/rfUmLqQDQbmFlreHqYT_aA] started
1365964-[2020-01-21T16:00:01,905][INFO ][o.e.s.SnapshotsService   ] [node-1] snapshot [my_backup:test-snap-aaphnvyiqvu680b4no2oqg/rfUmLqQDQbmFlreHqYT_aA] completed with state [SUCCESS]
1366141:[2020-01-21T16:00:01,938][INFO ][o.e.x.s.SnapshotRetentionTask] [node-1] starting snapshot retention deletion for [10] snapshots
1366270:[2020-01-21T16:00:01,943][INFO ][o.e.x.s.SnapshotRetentionTask] [node-1] [my_backup] snapshot retention deleting snapshot [test-snap-lywc9-ecrcygdlbz9kbkyw/9SGpTm-VR4KITk8sSFjGNQ]
1366450-[2020-01-21T16:00:02,179][INFO ][o.e.r.b.BlobStoreRepository] [node-1] [my_backup] Found stale root level blobs [meta-9SGpTm-VR4KITk8sSFjGNQ.dat, snap-9SGpTm-VR4KITk8sSFjGNQ.dat]. Cleaning them up
1366647-[2020-01-21T16:00:02,940][INFO ][o.e.s.SnapshotsService   ] [node-1] snapshot [my_backup:test-snap-lywc9-ecrcygdlbz9kbkyw/9SGpTm-VR4KITk8sSFjGNQ] deleted
1366801:[2020-01-21T16:00:02,967][INFO ][o.e.x.s.SnapshotRetentionTask] [node-1] [my_backup] snapshot retention deleting snapshot [test-snap-4st8t8xkt3u_y3bmlbdr6a/-GFKixKCQWaXup3KUA_3eQ]
...

and if I run GET _slm/stats the response shows that the retention is running and deleting snapshots as expected.

{
  "retention_runs" : 158,
  "retention_failed" : 0,
  "retention_timed_out" : 0,
  "retention_deletion_time" : "20.8m",
  "retention_deletion_time_millis" : 1250628,
  "total_snapshots_taken" : 1572,
  "total_snapshots_failed" : 1,
  "total_snapshots_deleted" : 1553,
  "total_snapshot_deletion_failures" : 0,
  "policy_stats" : [
    {
      "policy" : "fast-snapshots",
      "snapshots_taken" : 1572,
      "snapshots_failed" : 1,
      "snapshots_deleted" : 1553,
      "snapshot_deletion_failures" : 0
    }
  ]
}

However, when I tried setting up the same policy in a 3-node cluster that had slightly more data, the snapshots would get created, but never deleted. The logs look like this:

588069:[2020-01-21T13:30:00,001][INFO ][o.e.x.s.SnapshotRetentionTask] [sbxelkes-p11b] starting SLM retention snapshot cleanup task
588194-[2020-01-21T13:30:07,147][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [sbxelkes-p11b] Get stats for datafeed '_all'
588310-[2020-01-21T13:30:17,081][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [sbxelkes-p11b] Get stats for datafeed '_all'
588426-[2020-01-21T13:30:27,288][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [sbxelkes-p11b] Get stats for datafeed '_all'
588542-[2020-01-21T13:30:37,229][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [sbxelkes-p11b] Get stats for datafeed '_all'
--
629872-[2020-01-21T14:29:47,805][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [sbxelkes-p11b] Get stats for datafeed '_all'
629988-[2020-01-21T14:29:57,975][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [sbxelkes-p11b] Get stats for datafeed '_all'
630104-[2020-01-21T14:30:00,039][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [sbxelkes-p11b] uncaught exception in thread [elasticsearch[sbxelkes-p11b][generic][T#3]]
630272-org.elasticsearch.ElasticsearchException: java.lang.IllegalStateException: slm retention snapshot deletion out while waiting for ongoing snapshot operations to complete
630441: at org.elasticsearch.xpack.slm.SnapshotRetentionTask.lambda$maybeDeleteSnapshots$15(SnapshotRetentionTask.java:328) ~[?:?]
630565: at org.elasticsearch.xpack.slm.SnapshotRetentionTask$NoSnapshotRunningListener.onTimeout(SnapshotRetentionTask.java:515) ~[?:?]
630694- at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) ~[elasticsearch-7.5.1.jar:7.5.1]
630845- at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) ~[elasticsearch-7.5.1.jar:7.5.1]
630999- at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:598) ~[elasticsearch-7.5.1.jar:7.5.1]
631142- at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[elasticsearch-7.5.1.jar:7.5.1]
--
673130-[2020-01-21T15:29:28,440][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [sbxelkes-p11b] Get stats for datafeed '_all'
673246-[2020-01-21T15:29:38,178][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [sbxelkes-p11b] Get stats for datafeed '_all'
673362-[2020-01-21T15:29:48,125][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [sbxelkes-p11b] Get stats for datafeed '_all'
673478-[2020-01-21T15:29:58,300][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [sbxelkes-p11b] Get stats for datafeed '_all'
673594:[2020-01-21T15:30:00,000][INFO ][o.e.x.s.SnapshotRetentionTask] [sbxelkes-p11b] starting SLM retention snapshot cleanup task
...

Originally I believed it was because I was trying to take snapshots too quickly, and so there was no time to get any snapshot deletions in, but I changed the SLM schedule to 0 * * * * ? with an expiry of 6h and the retention schedule to 0 30 * * * ?, so there shouldn't have been any snapshot creations running while the retention task was running, since none of the snapshots took longer than 5 minutes to create. GET _slm/stats shows that no snapshots have been deleted.

{
  "retention_runs" : 103,
  "retention_failed" : 0,
  "retention_timed_out" : 0,
  "retention_deletion_time" : "0s",
  "retention_deletion_time_millis" : 0,
  "total_snapshots_taken" : 21,
  "total_snapshots_failed" : 1,
  "total_snapshots_deleted" : 0,
  "total_snapshot_deletion_failures" : 0,
  "policy_stats" : [
    {
      "policy" : "fast-snapshots",
      "snapshots_taken" : 21,
      "snapshots_failed" : 1,
      "snapshots_deleted" : 0,
      "snapshot_deletion_failures" : 0
    }
  ]
}

even though _cat/snapshots/sbx-es-7 clearly shows hourly snapshots well past expiration:

test-snap-fssc87zgtqkqdfiopkeljq SUCCESS 1579550399 19:59:59 1579550666 20:04:26  4.4m 82 123 0 123
test-snap-taurpr7rrd2jjxtiivoc6g SUCCESS 1579553999 20:59:59 1579554027 21:00:27   27s 82 123 0 123
test-snap-cfszpk8ctm2jce43tzm2aw SUCCESS 1579557599 21:59:59 1579557627 22:00:27 27.4s 82 123 0 123
test-snap-re5chymnrws6ah4ziiwvbg SUCCESS 1579561199 22:59:59 1579561229 23:00:29 29.1s 80 121 0 121
test-snap-miyrv-7eqiqgh5y4inxbma SUCCESS 1579564799 23:59:59 1579564834 00:00:34 34.4s 80 121 0 121
test-snap-xs6tcg6gsdyv7xxbgaorhg SUCCESS 1579568399 00:59:59 1579568421 01:00:21 21.9s 85 126 0 126
test-snap-tdavsajiqn6vnzafmy2dnw SUCCESS 1579571999 01:59:59 1579572020 02:00:20 20.9s 85 126 0 126
test-snap-j_2qetm6tlm3cxymxsq7ww SUCCESS 1579575599 02:59:59 1579575619 03:00:19 19.7s 85 126 0 126
test-snap-uxi-lwsuskwz7ryaqck8ng SUCCESS 1579579199 03:59:59 1579579222 04:00:22 23.1s 85 126 0 126
test-snap-4koycujkslizvcx4x7vptg SUCCESS 1579582800 05:00:00 1579582824 05:00:24 24.1s 85 126 0 126
test-snap-lzz9onjeqooqpzopr-74uw SUCCESS 1579586400 06:00:00 1579586422 06:00:22 21.5s 77 118 0 118
test-snap-6xyqtazzsf-ejjwoszbj0a SUCCESS 1579590000 07:00:00 1579590025 07:00:25 25.7s 77 118 0 118
test-snap-wp6clxi2rk-t3vngl5ynig SUCCESS 1579593600 08:00:00 1579593622 08:00:22 22.9s 77 118 0 118
test-snap-ee2bjiunrd6yggf5usc4pa SUCCESS 1579597199 08:59:59 1579597223 09:00:23 23.3s 77 118 0 118
test-snap-62_zf-d9rdqrhbfb2rx_dw SUCCESS 1579600799 09:59:59 1579600821 10:00:21 21.9s 77 118 0 118
test-snap-si8nptl9sb-bg7_eqg_xda SUCCESS 1579604400 11:00:00 1579604421 11:00:21 21.8s 77 118 0 118
test-snap-fchx6nf8rgulgl4x3ppbbq SUCCESS 1579607999 11:59:59 1579608023 12:00:23 23.8s 77 118 0 118
test-snap-neihu2tyqj2lglnzgur27q SUCCESS 1579611599 12:59:59 1579611622 13:00:22 22.7s 77 118 0 118
test-snap-xtfinmrsq_mrnekjlijbha SUCCESS 1579615200 14:00:00 1579615239 14:00:39 39.3s 77 118 0 118
test-snap-i5gjidxgqskkjrxjfceoda SUCCESS 1579622399 15:59:59 1579622428 16:00:28 28.3s 77 118 0 118
test-snap-jsmjngbwqzsobqqmd1otda SUCCESS 1579625999 16:59:59 1579626024 17:00:24 24.6s 77 118 0 118

Same issue for me. The retention ran successfully a couple times but when I checked there were 2 days worth of snapshots that should have been deleted already.

If I run the Retention manually, nothing happens. I'm facing this issue in multiple clusters... in our DEV cluster I did a full elasticsearch restart on all nodes and got it back working but I believe this is a bug.

ES version: 7.5.1

Looks like the 7.5.2 release fix this bug via https://github.com/elastic/elasticsearch/pull/50868

Yep, 7.5.2 fixed it. I forgot to update this post after I saw the release notes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.