Curator Elasticsearch 404 on delete_snapshots action in AWS Elasticsearch

timvisher · May 8, 2018, 1:48pm

I have a curator action file with the following contents:

---
# Remember, leave a key empty if there is no value.  None will be a string,
# not a Python "NoneType"
actions:
  1:
    action: snapshot
    description: Take a snapshot of logstash-* in the whole cluster
    options:
      repository: logstash-backups
      name: 'logstash-backups-%Y%m%d%H%M%S'
      wait_for_completion: True
      # 6 hours of seconds
      max_wait: 21600
      wait_interval: 10
      continue_if_exception: False
    filters:
      # backup everything and let elasticsearch manage deduping
      - filtertype: pattern
        kind: prefix
        value: logstash-
      # # backup only indexes older than 7 days
      # - filtertype: age
      #   source: creation_date
      #   direction: older
      #   unit: days
      #   unit_count: 14
  2:
    action: delete_indices
    description: Delete indices outside of the past 7 days
    options:
      continue_if_exception: False
      ignore_empty_list: True
    filters:
      - filtertype: pattern
        kind: prefix
        value: logstash-
      # delete logstash indices created more than 7 days ago.
      - filtertype: age
        source: creation_date
        direction: older
        unit: days
        unit_count: 7
      # # delete logstash indices based on total size
      # - filtertype: space
      #   use_age: True
      #   source: creation_date
      #   disk_space: 2000
  3:
    action: delete_snapshots
    description: Delete snapshots greater than 60 days old
    options:
      repository: logstash-backups
      retry_interval: 120
      retry_count: 3
      continue_if_exception: False
      # FIXME Should be removed once we have 60 days of backups. i.e. 60
      # days from 2018-01-05
      ignore_empty_list: True
    filters:
      - filtertype: age
        source: creation_date
        direction: older
        unit: days
        unit_count: 60

The delete_snapshots step fails every day with an exception like the following (UTC times):

2018-05-07 09:23:15,250 INFO      Preparing Action ID: 3, "delete_snapshots"
2018-05-07 09:23:15,257 INFO      Trying Action ID: 3, "delete_snapshots": Delete snapshots greater than 60 days old
2018-05-07 09:23:15,988 INFO      Deleting selected snapshots
2018-05-07 09:23:16,793 INFO      Deleting snapshot logstash-backups-20180228080001...
2018-05-07 09:24:16,546 ERROR     Failed to complete action: delete_snapshots.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(404, 'snapshot_missing_exception', '[logstash-backups:logstash-backups-20180228080001] is missing')

Looking at my monitoring around the count of snapshots, we do lose a snapshot right after that 404, seeming to indicate a timeout of some kind. I know that that AWS ELB has a 60 second non-configurable timeout as well.

Any ideas how I can get the job to successfully delete the snapshots? Or how I can go about debugging what's happening? AWS support is not being very helpful.

theuntergeek · May 8, 2018, 2:29pm

Not much you can do if you're going through an ELB. Even if you increase timeout or set timeout_override, the ELB timeout is going to bite you.

Is there a way you can go directly to a client node and not go through an ELB?

timvisher · May 8, 2018, 2:41pm

Not in AWS.

My current thought is to configure 5 or so delete_snapshots actions that each try several times to delete combined with a timeout_override that is under the 60 second ELB. It's a bandaid but it might work.

theuntergeek · May 8, 2018, 2:43pm

The delete will continue in the background, even if it hasn't returned control to the client in the foreground yet, and even if the ELB times out. If you're going to employ an approach like that, do understand that this is what's going on in the background.

Can you change the ELB timeout value?

timvisher · May 8, 2018, 2:48pm

That is not a feature. It's really frustrating. Probably the most frustrating day to day bit of AWS ES I know of.

It appears that if the delete is continuing in the background Curator notices that as a

2018-05-08 14:18:18,029 INFO      Snapshot activity detected in Tasks API

and I get into the retry logic, which is what I want.

In all cases, the delete does actually succeed, and fairly quickly after the timeout happens. Unfortunately, I can't seem to get the single delete_snapshots action to notice that and retry itself, hence the idea of having multiple delete_snapshots actions.

theuntergeek · May 8, 2018, 3:14pm

Unfortunately, the tasks API doesn't say which snapshot activity belongs to what, otherwise I'd rely on that for checking.

You might do better with curator_cli and a shell script to do multiple delete_snapshot actions, to make sure it's all good and done.

timvisher · May 8, 2018, 5:08pm

That's definitely something I could do. I'll consider it when I get around to it.

system · June 5, 2018, 5:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Delete_snapshot action fails, get() takes at least 2 arguments (2 given) Elasticsearch curator	7	959	March 24, 2020
[Solved]Elasticsearch curator delete error? Elasticsearch	3	616	April 24, 2017
Curator issue - No actionable item Elasticsearch	8	3769	September 14, 2017
Curator Actions File Elasticsearch curator	4	483	December 24, 2019
Can not to delete snapshots Elasticsearch	2	848	March 23, 2020

Curator Elasticsearch 404 on delete_snapshots action in AWS Elasticsearch

Related topics