I have a curator action file with the following contents:
---
# Remember, leave a key empty if there is no value. None will be a string,
# not a Python "NoneType"
actions:
1:
action: snapshot
description: Take a snapshot of logstash-* in the whole cluster
options:
repository: logstash-backups
name: 'logstash-backups-%Y%m%d%H%M%S'
wait_for_completion: True
# 6 hours of seconds
max_wait: 21600
wait_interval: 10
continue_if_exception: False
filters:
# backup everything and let elasticsearch manage deduping
- filtertype: pattern
kind: prefix
value: logstash-
# # backup only indexes older than 7 days
# - filtertype: age
# source: creation_date
# direction: older
# unit: days
# unit_count: 14
2:
action: delete_indices
description: Delete indices outside of the past 7 days
options:
continue_if_exception: False
ignore_empty_list: True
filters:
- filtertype: pattern
kind: prefix
value: logstash-
# delete logstash indices created more than 7 days ago.
- filtertype: age
source: creation_date
direction: older
unit: days
unit_count: 7
# # delete logstash indices based on total size
# - filtertype: space
# use_age: True
# source: creation_date
# disk_space: 2000
3:
action: delete_snapshots
description: Delete snapshots greater than 60 days old
options:
repository: logstash-backups
retry_interval: 120
retry_count: 3
continue_if_exception: False
# FIXME Should be removed once we have 60 days of backups. i.e. 60
# days from 2018-01-05
ignore_empty_list: True
filters:
- filtertype: age
source: creation_date
direction: older
unit: days
unit_count: 60
The delete_snapshots
step fails every day with an exception like the following (UTC times):
2018-05-07 09:23:15,250 INFO Preparing Action ID: 3, "delete_snapshots"
2018-05-07 09:23:15,257 INFO Trying Action ID: 3, "delete_snapshots": Delete snapshots greater than 60 days old
2018-05-07 09:23:15,988 INFO Deleting selected snapshots
2018-05-07 09:23:16,793 INFO Deleting snapshot logstash-backups-20180228080001...
2018-05-07 09:24:16,546 ERROR Failed to complete action: delete_snapshots. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(404, 'snapshot_missing_exception', '[logstash-backups:logstash-backups-20180228080001] is missing')
Looking at my monitoring around the count of snapshots, we do lose a snapshot right after that 404, seeming to indicate a timeout of some kind. I know that that AWS ELB has a 60 second non-configurable timeout as well.
Any ideas how I can get the job to successfully delete the snapshots? Or how I can go about debugging what's happening? AWS support is not being very helpful.