Curator timeout_override parameter not working for snapshot_delete action

Hello community,

I am using curator for elasticsearch snapshot deletion, since couple of days I started observing that snapshot deleting took longer time than previous in short (more than 5 minutes) since then my curator job started failing it throws following error.

    2021-04-28 06:36:01,541 INFO      Deleting snapshot curator-2021-04-14-06:30:20...
    2021-04-28 06:41:01,695 ERROR     Failed to complete action: delete_snapshots.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: NotFoundError(404, 'snapshot_missing_exception', '[manual-triggered:curator-2021-04-14-06:30:20] is missing')

Then I found timeout_override | Curator Reference [5.8] | Elastic and tried to put in my config but still I see curator still usages 300s as timeout.

    ---
    actions:
      1:
        action: delete_snapshots
        description: >-
          "Delete selected snapshots from 'manual-triggerd' older than 14 matching curator-"
        options:
          repository: manual-triggered
          retry_interval: 120
          retry_count: 3
          ignore_empty_list: True
          timeout_override: 21600
        filters:
        - filtertype: pattern
          kind: prefix
          value: curator-
        - filtertype: age
          source: creation_date
          direction: older
          unit: days
          unit_count: 14

curator version - 5.8.1
elasticsearch version - 7.6.2

Thanks.

I'm sorry you've had a bad experience. I can tell you that a precise 5 minute timeout/disconnect irrespective of any timeout_override setting is indicative of a proxy or other similar service between Curator and the Elasticsearch instance it is trying to connect to. There are multiple issues in the Curator GitHub repository where this has been reported, from Amazon ELB instances to regular proxies. Given that you're connecting to Kubernetes, I can imagine that this may have something to do with it.

Unfortunately, there is nothing I can do to address this. The problem cannot be addressed from within Curator, as it happens outside of Curator. The okay-ish thing about this is that the initial API call to delete the snapshot will continue after the proxy disconnect. I have responded to the issue you raised in GitHub with more information and links to others experiencing the same problem.

On a different note, I see you're running version 5.8.1 of Curator. I don't remember off-hand what has been changed since 5.8.1, but 5.8.4 was released last night, and I recommend upgrading to it regardless (it won't fix proxy timeouts). There are many improvements and security patches.

1 Like

@theuntergeek Thanks a lot for your swift response..

We are using AWS managed elasticsearch and you are absolutely correct, there is 300s timeout on LB in front of elasticsearch services. So it's clear it's proxy timeout issue not curator in this case.

Thanks,
Shiv