Curator delete_snapshot always timeout

Background:

Elastic Version 6.2.1 + curator 5.5.1
Azure Storage is used for storing snapshots

I'm using curator to delete the snapshots. But it keeps failing with timeout error.

Below is the log

2018-09-06 20:42:09,389 INFO Preparing Action ID: 1, "delete_snapshots"
2018-09-06 20:42:09,525 INFO Trying Action ID: 1, "delete_snapshots": Delete continuous snapshot that are old and not relevant
2018-09-06 20:42:13,558 INFO Deleting selected snapshots
2018-09-06 20:42:14,851 INFO Deleting snapshot snapshot.2018.08.28.07.00...
2018-09-06 20:48:21,553 INFO Deleting snapshot snapshot.2018.08.28.13.00...
2018-09-06 21:33:21,602 ERROR Failed to complete action: delete_snapshots. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='......', port=9200): Read timed out. (read timeout=2700))

Tried with the timeout_override: 2700=>45 mts, but still fails. I have lot of snapshots to delete based on its created timeline as in snapshot name, and configured the curator input accordingly. Each snapshot represent ~1000 shards of 5-7TB.

Only 25% of shards are around 40GB, rest are in single digit GB. It is separated as different indice/shard due to other constraints and in process of rationalization. Entire cluster is taken for snapshots periodically (as represented by timeline in log against snapshot name) based on certain requirements.

I have removed "elasticsearch" and "urllib3" from blacklist of curator config with DEBUG level to see if any additional log is available on timeout, but nothing available except what is above. Though many other DEBUG, INFO logs are like what snapshots are available, filter applicability for each of the snapshot to decide what to remove etc. are shown

Tried deleting using Elastic API directly DELETE _snapshot/logrepo/snapshot.2018.08.29.13.00, and got Gateway timeout in ~8 sec.
Issued _cluster/pending_tasks to see below
{
"tasks": []
}
So tried issuing another DELETE with a different snapshot but got "concurrent_snapshot_execution_exception".

Any help on why the curator keeps timing out would be helpful. It is not curator rather elastic times out on deletion, but no info available on what is the action status.

The timeout being hard-fixed at 2700 seconds, regardless of what you set the timeout or timeout_override to implies a proxy, loadbalancer, or gateway of some kind. The Gateway timeout in ~8 sec you mentioned seems to confirm that. There isn't much you can do to overcome that.

As you can see, however, Elasticsearch does try to continue the snapshot delete in the background. As you can also see, it doesn't create a task that is visible in the pending_tasks API endpoint.

I am afraid that this is not in my capacity to address—and probably yours as well—as it does the same or similar with a manual API call to delete the snapshot.

Thanks Aaron.

But I can issue a curl command with DELETE snapshot action with a max timout (-m) setting for ex.
curl -k -XDELETE "https://..:9200/_snapshot/logrepo/snapshot.2018.08.29.13.00" -m 600
The result is as below
Before Curl Thu Sep 6 22:51:41 UTC 2018
{"acknowledged":true}
After Curl Thu Sep 6 22:59:06 UTC 2018

Gateway timeout error is when I issue the DELETE command form Kibana/DevTools. The error came in ~8 sec. on result window.
{
"statusCode": 504,
"error": "Gateway Time-out",
"message": "Client request timeout"
}

I have lot many snapshots to remove. Curator doesn't consistently error out after removing first item from delete identified list. Sometimes it is 1 snapshot removed, sometimes it is 2 and timeout error comes.

If I can issue -m to curl and have DELETE executed for each snapshot, should --timeout-override does the same with curator when it calls elastic API.

Sorry - trying to understand the cause better. 2700 is fixed in curator action setting for delete_snapshot using timeout_override: 2700. So my guess would be that curator issues http with max timout as 2700 for every delete snapshot request it makes for the list of identified snapshots

We don't have any gateway between ES and the curator.

Could you please clarify.

Sure.

This is because curl is:

  • only deleting a single snapshot
  • not waiting for that snapshot deletion to complete

Curator can have multiple snapshots selected for deletion at once, which requires it to wait for the first to finish before proceeding to another. If you had immediately attempted to delete another snapshot after that curl call, you would almost certainly have encountered another concurrent_snapshot_execution_exception, because it was happening in the background. You should be able to see some of this with a GET _tasks, rather than GET _cluster/pending_tasks. It will show something indicating snapshot deletion is in progress.

Because individual snapshots can vary in the count and size of segments involved, this explains why Curator is sometimes able to delete 2 before the client timeout, and sometimes not.

If you set timeout_override higher than 2700, does it still time out at 2700? Or does it time out at whatever you've set? I would suggest setting it to something considerably higher, like 21600. The default is 300, so 2700 is not hard coded into Curator.

Thanks Aaron.

Agree that curator do have a list of snapshots to delete, and it is observed in debug log as well. If my understanding on curator is correct, it performs something similar to below in sequence on every snapshot it identified for deletion may be using a loop or some other construct.

"Test.sh" - Exact shell script to delete snapshots in sequence
echo date
curl -k -XDELETE "https://...:9200/_snapshot/logrepo/snapshot.2018.08.30.13.00" -u <> -m 600
echo date
curl -k -XDELETE "https://...:9200/_snapshot/logrepo/snapshot.2018.08.30.19.00" -u <> -m 600
echo date

The script resulted as below
Thu Sep 6 23:32:49 UTC 2018
{"acknowledged":true}
Thu Sep 6 23:39:14 UTC 2018
{"acknowledged":true}
Thu Sep 6 23:48:04 UTC 2018

In the shell case as well, the the first one finishes in ~7 mints to get acknowledged:true response. Then the second snapshot delete request is accepted => No snapshot operation in progress. If I issue a DELETE snapshot between 32:49 and 39:14, I do get concurrent_snapshot_execution_exception, which is as expected. But post 39:14, the next curl on DELETE is accepted.

Shouldn't curator do the same? Probably it may send -m as 2700 based on override. KIndly clarify.

Yes. _tasks api show the snapshot delete action against elected master node.

That's all that Curator does:

        try:
            for s in self.snapshot_list.snapshots:
                self.loggit.info('Deleting snapshot {0}...'.format(s))
                self.client.snapshot.delete(
                    repository=self.repository, snapshot=s)

It immediately attempts to delete the next snapshot as soon as the last one returns control to the client. There is no other code going on in this block.

Thanks Aaron.

Somehow I'm suspicious that the timeout is always set to 300 irrespective of timeout_override being set to 2700.

Not much familiar with python code, but observing the curator code I see below snippets.

In cli.py

def run(config, action_file, dry_run=False):
.....
if isinstance(timeout_override, int):
client_args['timeout'] = timeout_override
else:
client_args['timeout'] = default_timeout

    # Set up action kwargs
    kwargs = {}
    kwargs['master_timeout'] = (
        client_args['timeout'] if client_args['timeout'] <= 300 else 300)
    kwargs['dry_run'] = dry_run

Create a client object for each action...

    client = get_client(**client_args)

In utils.

def get_client(**kwargs):
Most of the code is on aws..I believe it ES SAAS , and don't see any timeout attribute handling there. Then I see client = elasticsearch.Elasticsearch(**kwargs). Haven't looked at Elasticsearch module code to check how kwargs are used there to instantiate client.

Also, I do see some more functions to get snapshot actions from tasks. Should there be a option to use task status instead of timeout?

If you are using an action.yml file, it uses timeout_override. See here, here, and here. There is no avoiding this override, if it's set. You would never be able to get a 2700 second timeout if it were not so. Unit testing confirms that the value is appropriately passed to the get_client function.

Unfortunately, so far Elasticsearch does not create a task id for a snapshot delete action. Without that, there's no accurate way of tracking a snapshot delete. Instead, it's more of a guess. You can see that there is snapshot delete activity in the _tasks API call, but it's not actually associated with a specific task. The problem is that without a task id, I have no idea whether there is only 1 task, or multiple tasks spawned to delete the snapshot with its associated segments. Do the tasks stay active for a long time? A short time? Are there invisible sub-tasks? It's very hard to chase this down. I do not know how it would behave in a heavily loaded cluster, a lightly loaded cluster, a repository which has a very large number of snapshots & segments vs. the opposite. It's a frustrating endeavor. Perhaps it is still possible to tease out a solution.

I personally do not have the time or energy or a pressing need to code a way to monitor the _tasks API output for this sort of guess work. If someone else were to submit a pull request to add it to the existing utils.wait_for_it function collection, I'd be happy to review the code and merge it.

Thanks AAron,

Yes. the code snippet references do seem to get the timeout_override. But, the assignment client_args['timeout'] if client_args['timeout'] <= 300 else 300) @line 156 looks to override the value and got confused. get_client however is passed only with client_args, and kwargs is used when instantiating the action action_obj = action_class(slo, **mykwargs).

I set it as 2700 (a higher wild value) as I noticed that each snapshot deletion takes 5-6 minutes appx. thus to cover many snapshots to get deleted 2700/6. But not very sure how the loop stops in between. I had around 30-40 snapshots to delete

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.