I have a curator action file with the following contents:
--- # Remember, leave a key empty if there is no value. None will be a string, # not a Python "NoneType" actions: 1: action: snapshot description: Take a snapshot of logstash-* in the whole cluster options: repository: logstash-backups name: 'logstash-backups-%Y%m%d%H%M%S' wait_for_completion: True # 6 hours of seconds max_wait: 21600 wait_interval: 10 continue_if_exception: False filters: # backup everything and let elasticsearch manage deduping - filtertype: pattern kind: prefix value: logstash- # # backup only indexes older than 7 days # - filtertype: age # source: creation_date # direction: older # unit: days # unit_count: 14 2: action: delete_indices description: Delete indices outside of the past 7 days options: continue_if_exception: False ignore_empty_list: True filters: - filtertype: pattern kind: prefix value: logstash- # delete logstash indices created more than 7 days ago. - filtertype: age source: creation_date direction: older unit: days unit_count: 7 # # delete logstash indices based on total size # - filtertype: space # use_age: True # source: creation_date # disk_space: 2000 3: action: delete_snapshots description: Delete snapshots greater than 60 days old options: repository: logstash-backups retry_interval: 120 retry_count: 3 continue_if_exception: False # FIXME Should be removed once we have 60 days of backups. i.e. 60 # days from 2018-01-05 ignore_empty_list: True filters: - filtertype: age source: creation_date direction: older unit: days unit_count: 60
delete_snapshots step fails every day with an exception like the following (UTC times):
2018-05-07 09:23:15,250 INFO Preparing Action ID: 3, "delete_snapshots" 2018-05-07 09:23:15,257 INFO Trying Action ID: 3, "delete_snapshots": Delete snapshots greater than 60 days old 2018-05-07 09:23:15,988 INFO Deleting selected snapshots 2018-05-07 09:23:16,793 INFO Deleting snapshot logstash-backups-20180228080001... 2018-05-07 09:24:16,546 ERROR Failed to complete action: delete_snapshots. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(404, 'snapshot_missing_exception', '[logstash-backups:logstash-backups-20180228080001] is missing')
Looking at my monitoring around the count of snapshots, we do lose a snapshot right after that 404, seeming to indicate a timeout of some kind. I know that that AWS ELB has a 60 second non-configurable timeout as well.
Any ideas how I can get the job to successfully delete the snapshots? Or how I can go about debugging what's happening? AWS support is not being very helpful.