Background:
Elastic Version 6.2.1 + curator 5.5.1
Azure Storage is used for storing snapshots
I'm using curator to delete the snapshots. But it keeps failing with timeout error.
Below is the log
2018-09-06 20:42:09,389 INFO Preparing Action ID: 1, "delete_snapshots"
2018-09-06 20:42:09,525 INFO Trying Action ID: 1, "delete_snapshots": Delete continuous snapshot that are old and not relevant
2018-09-06 20:42:13,558 INFO Deleting selected snapshots
2018-09-06 20:42:14,851 INFO Deleting snapshot snapshot.2018.08.28.07.00...
2018-09-06 20:48:21,553 INFO Deleting snapshot snapshot.2018.08.28.13.00...
2018-09-06 21:33:21,602 ERROR Failed to complete action: delete_snapshots. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='......', port=9200): Read timed out. (read timeout=2700))
Tried with the timeout_override: 2700=>45 mts, but still fails. I have lot of snapshots to delete based on its created timeline as in snapshot name, and configured the curator input accordingly. Each snapshot represent ~1000 shards of 5-7TB.
Only 25% of shards are around 40GB, rest are in single digit GB. It is separated as different indice/shard due to other constraints and in process of rationalization. Entire cluster is taken for snapshots periodically (as represented by timeline in log against snapshot name) based on certain requirements.
I have removed "elasticsearch" and "urllib3" from blacklist of curator config with DEBUG level to see if any additional log is available on timeout, but nothing available except what is above. Though many other DEBUG, INFO logs are like what snapshots are available, filter applicability for each of the snapshot to decide what to remove etc. are shown
Tried deleting using Elastic API directly DELETE _snapshot/logrepo/snapshot.2018.08.29.13.00, and got Gateway timeout in ~8 sec.
Issued _cluster/pending_tasks to see below
{
"tasks": []
}
So tried issuing another DELETE with a different snapshot but got "concurrent_snapshot_execution_exception".
Any help on why the curator keeps timing out would be helpful. It is not curator rather elastic times out on deletion, but no info available on what is the action status.