Curator Elasticsearch 404 on delete_snapshots action in AWS Elasticsearch

I have a curator action file with the following contents:

---
# Remember, leave a key empty if there is no value.  None will be a string,
# not a Python "NoneType"
actions:
  1:
    action: snapshot
    description: Take a snapshot of logstash-* in the whole cluster
    options:
      repository: logstash-backups
      name: 'logstash-backups-%Y%m%d%H%M%S'
      wait_for_completion: True
      # 6 hours of seconds
      max_wait: 21600
      wait_interval: 10
      continue_if_exception: False
    filters:
      # backup everything and let elasticsearch manage deduping
      - filtertype: pattern
        kind: prefix
        value: logstash-
      # # backup only indexes older than 7 days
      # - filtertype: age
      #   source: creation_date
      #   direction: older
      #   unit: days
      #   unit_count: 14
  2:
    action: delete_indices
    description: Delete indices outside of the past 7 days
    options:
      continue_if_exception: False
      ignore_empty_list: True
    filters:
      - filtertype: pattern
        kind: prefix
        value: logstash-
      # delete logstash indices created more than 7 days ago.
      - filtertype: age
        source: creation_date
        direction: older
        unit: days
        unit_count: 7
      # # delete logstash indices based on total size
      # - filtertype: space
      #   use_age: True
      #   source: creation_date
      #   disk_space: 2000
  3:
    action: delete_snapshots
    description: Delete snapshots greater than 60 days old
    options:
      repository: logstash-backups
      retry_interval: 120
      retry_count: 3
      continue_if_exception: False
      # FIXME Should be removed once we have 60 days of backups. i.e. 60
      # days from 2018-01-05
      ignore_empty_list: True
    filters:
      - filtertype: age
        source: creation_date
        direction: older
        unit: days
        unit_count: 60

The delete_snapshots step fails every day with an exception like the following (UTC times):

2018-05-07 09:23:15,250 INFO      Preparing Action ID: 3, "delete_snapshots"
2018-05-07 09:23:15,257 INFO      Trying Action ID: 3, "delete_snapshots": Delete snapshots greater than 60 days old
2018-05-07 09:23:15,988 INFO      Deleting selected snapshots
2018-05-07 09:23:16,793 INFO      Deleting snapshot logstash-backups-20180228080001...
2018-05-07 09:24:16,546 ERROR     Failed to complete action: delete_snapshots.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(404, 'snapshot_missing_exception', '[logstash-backups:logstash-backups-20180228080001] is missing')

Looking at my monitoring around the count of snapshots, we do lose a snapshot right after that 404, seeming to indicate a timeout of some kind. I know that that AWS ELB has a 60 second non-configurable timeout as well.

Any ideas how I can get the job to successfully delete the snapshots? Or how I can go about debugging what's happening? AWS support is not being very helpful.

Not much you can do if you're going through an ELB. Even if you increase timeout or set timeout_override, the ELB timeout is going to bite you.

Is there a way you can go directly to a client node and not go through an ELB?

Not in AWS. :frowning:

My current thought is to configure 5 or so delete_snapshots actions that each try several times to delete combined with a timeout_override that is under the 60 second ELB. It's a bandaid but it might work.

The delete will continue in the background, even if it hasn't returned control to the client in the foreground yet, and even if the ELB times out. If you're going to employ an approach like that, do understand that this is what's going on in the background.

Can you change the ELB timeout value?

That is not a feature. It's really frustrating. Probably the most frustrating day to day bit of AWS ES I know of. :slight_smile:

It appears that if the delete is continuing in the background Curator notices that as a

2018-05-08 14:18:18,029 INFO      Snapshot activity detected in Tasks API

and I get into the retry logic, which is what I want.

In all cases, the delete does actually succeed, and fairly quickly after the timeout happens. Unfortunately, I can't seem to get the single delete_snapshots action to notice that and retry itself, hence the idea of having multiple delete_snapshots actions.

Unfortunately, the tasks API doesn't say which snapshot activity belongs to what, otherwise I'd rely on that for checking.

You might do better with curator_cli and a shell script to do multiple delete_snapshot actions, to make sure it's all good and done.

That's definitely something I could do. I'll consider it when I get around to it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.