Curator - DeleteSnapshots - Not fully functional

Environment: Curator 5.5.1, Elastic Search 6.2.1

Trying to use delete_snapshots actions using curator, but it fails with below error always

2018-06-05 23:26:03,003 DEBUG curator.utils iterate_filters:528 Post-instance: [list of snapshots.........]
2018-06-05 23:26:03,003 DEBUG curator.cli process_action:99 Doing the action here.
2018-06-05 23:26:03,003 INFO curator.actions.delete_snapshots do_action:991 Deleting selected snapshots
2018-06-05 23:26:03,587 INFO curator.actions.delete_snapshots do_action:1000 Deleting snapshot <<Snapshot 1 from the list>>
2018-06-05 23:41:03,590 ERROR curator.cli run:184 Failed to complete action: delete_snapshots. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='-------', port=9200): Read timed out. (read timeout=900))

timeout_override is 900 against the default 300.

The action has couple of filters to filter snapshots based on regex and age. Looking at curator debug log, it finally lands up with the list of snapshots after applying filter. Then it loops thru' the list and issues delete snapshot for the first item. Post that it hangs and doesn't move forward with rest. Thought it is a timeout issue and increased to 900, but no result. Once the timeout overruns, it errors, but do see the first snapshot in the list is deleted successfully.

When the curator was performing the action, monitored ES using _tasks API. It did list the delete snapshot task for the first item, which was well done within ~5 mts (300 sec). The curator should have started with next, but it never did. The _task API didn't list any delete snapshot action until the timer (900) overruns.

Any thoughts/guidance is appreciated. Looked at curator code cli.py, but the log comes out from catch-all (logger.error).

Curator is at 5.5.4 now, and there have definitely been some improvements since then.

As for what's going on in that wait period, more logging is necessary. Set blacklist: [] in the client config under the logging section, and set loglevel: debug. The default hides the elasticsearch and urllib3 responses, and those settings will unhide them, as well as add debug-level logging to what's happening inside Curator during that time period.

Hi Aaron.. Thanks for the comment.

Yup! I should have removed the entries from blacklist to get more info.

Below is the log I see once apply empty blacklisting.

2018-06-06 18:32:39,854 INFO curator.actions.delete_snapshots do_action:1000 Deleting snapshot <>...
2018-06-06 18:32:39,854 DEBUG urllib3.util.retry from_int:200 Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
2018-06-06 18:47:39,939 WARNING elasticsearch log_request_fail:97 DELETE https://-----:9200/_snapshot/<>/<> [status:N/A request:900.085s]
Traceback (most recent call last):
File "/home/buh/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
File "", line 2, in raise_from
File "/home/buh/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 383, in _make_request
File "/usr/local/lib/python3.6/http/client.py", line 1331, in getresponse
File "/usr/local/lib/python3.6/http/client.py", line 297, in begin
File "/usr/local/lib/python3.6/http/client.py", line 258, in _read_status
File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
File "/usr/local/lib/python3.6/ssl.py", line 1009, in recv_into
File "/usr/local/lib/python3.6/ssl.py", line 871, in read
File "/usr/local/lib/python3.6/ssl.py", line 631, in read
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/buh/.local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 115, in perform_request
File "/home/buh/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 639, in urlopen
File "/home/buh/.local/lib/python3.6/site-packages/urllib3/util/retry.py", line 333, in increment
File "/home/buh/.local/lib/python3.6/site-packages/urllib3/packages/six.py", line 686, in reraise
File "/home/buh/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 601, in urlopen
File "/home/buh/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 389, in _make_request
File "/home/buh/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 309, in _raise_timeout
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='-------', port=9200): Read timed out. (read timeout=900)

Observation

  1. Delete on snapshot is issued, but the code call es api on the snapshot that is removed. This would not return any valid value.
  2. Also looking at the log, the code calls _task api even before executing delete snapshot. Not sure what the reason would be as the snapshot delete action would not be part of result (as observed in log).
  3. Also observed ../snapshot/repo/_all is called multiple times (atleast twice ) before filtering and after filtering. Not sure why. If we have lot of snapshots, it will end-up listing all

Regardless the core issue could be on observation 1, as the loop sequence doesn't pick-up next snapshot.

Any guidance would be helpful.

Huh? The code in question is here:

        try:
            for s in self.snapshot_list.snapshots:
                self.loggit.info('Deleting snapshot {0}...'.format(s))
                self.client.snapshot.delete(
                    repository=self.repository, snapshot=s)
        except Exception as e:
            utils.report_failure(e)

Before it does this, it checks once to see if it is safe to delete a snapshot. It does not check this for each snapshot.

The flow is as follows, by the log lines (using the 5.5.1 tag, since that's what you're using, and your log lines reflect):

2018-06-05 23:26:03,003 INFO curator.actions.delete_snapshots do_action:991 Deleting selected snapshots

Line 991 is:

    self.loggit.info('Deleting selected snapshots')

The next step is line 992 which checks to see if it is safe to take a snapshot (or in this case delete one) by verifying there are no snapshots currently in progress. This part does check the _tasks API. Technically this is unnecessary, as the standard exception would result, but I test manually first so I can properly log it. That's all this step does.

Having established that it is safe to delete a snapshot, Curator then loops through each snapshot and deletes them one at a time.. This correlates with log line:

2018-06-05 23:26:03,587 INFO curator.actions.delete_snapshots do_action:1000 Deleting snapshot 

This redundancy will hopefully be addressed in a future release of Curator. It's mostly checking to see if something has changed from initial execution, i.e. new snapshots.

I respectfully disagree with that conclusion. The loop will not pick up the next snapshot until

self.client.snapshot.delete(repository=self.repository, snapshot=s)

...has returned. The problem is that it is not returning, and I can't explain why. It's a very simple call, and it normally returns almost immediately. Curator has been around for many years, and is production tested by many users, beyond the CI testing that it goes through for each build iteration. As I have not seen the behavior you're describing reported by any other users, I cannot accept the idea that the code is the issue here. There is nothing that checks and waits for a snapshot to delete to finish except for that call returning. As this works for everyone else, I have to suspect something in your environment isn't letting it complete.

My first suspect is something repository related. What kind of repository are you using? Have you looked at the Elasticsearch logs to see what the interaction between Elasticsearch and the shared filesystem is? This is the next place I would look to troubleshoot.

Hi Aaron,

Thanks for response.

We are using Azure storage for storing snapshots. I'm yet to look at ES logs as I see the snapshot process moving on successfully. Only the delete snapshot action fails.

" It's a very simple call, and it normally returns almost immediately" -- Yes, but I'm concerned on what property or condition would impact this? Is there any chance that the snapshot metadata size returned by _all on snapshot API impact this? If this is the case, the code did call the API twice and was successful.

When I use elastic delete snapshot for a given (Single) snapshot using using "Dev Tool", it fails on timeout from UI perspective, but the DELETE runs in background and clean's up the snapshot in 5-6 minutes. I believe the case is same here. Should we use _task API to monitor the snapshot delete task progress, and loop to next? Also, please refer to WARNING @ 2018-06-06 18:47:39,93 log_request_fail:97. The time is post 900sec timeout than the actual delete snapshot call @2018-06-06 18:32:39,854 for the same snapshot. The GET should be returning 404, snapshot not found, and logged as WARNING. But the trace comes immediately after that.

None. It's exclusively upstream in this case. I wish I had good news for you, but Azure has been a bit suspect with its behavior as a snapshot repository. Users have frequently had to employ the skip_repo_fs_check option as Azure fs often did not respond in a timely enough fashion for this check to work.

No. No chance at all, because those calls are quite complete before a single delete is ever attempted. And they are not called again during the delete loop.

I agree 100%. The huge problem for me is the "it fails on timeout from UI perspective," part. If it fails in Console, what makes you think it will be successful anywhere else? This is a huge red flag. This call should never timeout in Console, or take longer than 30 seconds. In rare cases, I expect up to 3 minutes (the new timeout_override for delete_snapshots is 180 seconds, if I recall correctly). However, these are outliers. That your cluster + repo fs cannot delete a snapshot within 900 seconds is an indicator that there is a problem somewhere.

This solution would only address this use case, which is—as previously mentioned—exceptionally rare. You're the only user to ever report this behavior to me. I think the issue is between Elasticsearch and your Azure repository, rather than within Curator. The evidence supporting this conclusion is that that Console is timing out for you in exactly the same way.

Out of curiosity, what is your cloud.azure.storage.timeout set to?

Hi Aaron,

We haven't set it explicitly, and left it to default. It should be 5 mts as read from documentation.

Also, trying to understand the impact of skip_repo_fs_check w.r.t delete_snapshot action, but not able to find references in elastic api. Because the plugin doesn't provide a option for this. May be I'm missing.

API Documentation — Elasticsearch 8.0.0 documentation. Is it a check from curator? Please clarify

The name skip_repo_fs_check is what Curator uses for the repository verification API call to Elasticsearch. It is not used for snapshot deletions, but was mentioned because you're using Azure, and that has been known to be more slow to respond as a snapshot repository.

Curator wants to ensure that all nodes are properly connected to the shared file system before performing actions (creating a snapshot, specifically), so that there are no unexpected surprises. It uses this API call to do so. Ordinarily, it takes a moment to test this, but on some network file shares—notably S3 and Azure—it can take just long enough that the verify call comes back with a negative result, even though it still works for snapshots.

Got it. But, the issue here is on delete_snapshots so we don't need to set it right? Please confirm. I don't see any issues with snapshot though.

If we leave skip_repo_fs_check aside for delete snapshot as it is not applicable, the only choice would be to refer to elastic logs!

Any other things that I should be able to check. I monitored Azure Storage metrics and when the delete operation is performing, I do see the transactions. It should represent clean-up of files/file references. It co-relates to the delete action of first indice. Also, the task is living for the same duration as I see the metrics in Azure.

If the dev tools is getting timeout error (below), the same is anticipated here. Agreed. Would it be due to any setting on the n/w. Probably at n/w routing level? Because ES runs behind and ensures that the snapshot is removed.
{
"statusCode": 504,
"error": "Gateway Time-out",
"message": "Client request timeout"
}

Is the timeout on ES-Curator/ES Dev Tool (or) Storage to ES. If it is the second case (Storage to ES), the deletion itself should fail. So I may need to check the routing between ES and its client.

That is correct.

Is that in Kibana? Or is that in the Elasticsearch logs?

I recommend an appeal directly to the Elasticsearch logs. Kibana is a sort of proxy/backend, so I'm not sure how much to read into the 504 here. The direct feed from Elasticsearch will prove more helpful, I believe. The client->Elasticsearch is timing out because (and this is my belief) there is too much lag between Elasticsearch and Azure.

It is from Kibana--Dev Tools

The timeout setting that is given (900 Sec) is for client to elastic is my understand. Moreover, ES was successfully deleting the snapshot after 5 minutes. It is well within the 900s timeout setting. Also, if I leave the timeout to default ES completes its task.

Will look into ES logs. Hopefully I can monitor only the node represented by snapshot delete task. Have 30 nodes..so don't want to monitor logs on all nodes :slight_smile:

Your starting place should be the node that Curator and/or Kibana connect to.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.