Curator snapshots with TransportError

Hello, I've installed curator 3.5.1 on a ES 1.4.2; I'm running daily snapshots on all indices and from time to time I have 503 errors saying there's another snapshot already running (when the only task is the one I launched with curator)....

After setting up debugging I see this:
2016-07-19 15:53:54,402 DEBUG urllib3.util.retry from_int:170 Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0) 2016-07-19 15:54:53,571 DEBUG urllib3.connectionpool _make_request:401 "PUT /_snapshot/backups/curator-20160719135353?wait_for_completion=true HTTP/1.1" 504 0 2016-07-19 15:54:53,571 WARNING elasticsearch log_request_fail:82 PUT /_snapshot/backups/curator-20160719135353?wait_for_completion=true [status:504 request:59.169s] 2016-07-19 15:54:53,572 DEBUG elasticsearch log_request_fail:90 > {"indices": "geo_locations_offers_production,geo_shapes,offers_v3_production,ranking_medians_production", "partial": false, "ignore_unavailable": false, "include_global_state": true} 2016-07-19 15:54:53,572 DEBUG elasticsearch log_request_fail:93 < 2016-07-19 15:54:53,572 DEBUG urllib3.util.retry from_int:170 Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0) 2016-07-19 15:55:05,336 DEBUG urllib3.connectionpool _make_request:401 "PUT /_snapshot/backups/curator-20160719135353?wait_for_completion=true HTTP/1.1" 503 243 2016-07-19 15:55:05,337 WARNING elasticsearch log_request_fail:82 PUT /_snapshot/backups/curator-20160719135353?wait_for_completion=true [status:503 request:11.765s] 2016-07-19 15:55:05,337 DEBUG elasticsearch log_request_fail:90 > {"indices": "geo_locations_offers_production,geo_shapes,offers_v3_production,ranking_medians_production", "partial": false, "ignore_unavailable": false, "include_global_state": true} 2016-07-19 15:55:05,337 DEBUG elasticsearch log_request_fail:93 < {"error":"RemoteTransportException[[ip-ha-ha-ha-es-01][inet[/ha.ha.ha.ha:9300]][cluster:admin/snapshot/create]]; nested: ConcurrentSnapshotExecutionException[[backups:curator-20160719135353] a snapshot is already running]; ","status":503} 2016-07-19 15:55:05,337 DEBUG urllib3.util.retry from_int:170 Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0) 2016-07-19 15:55:05,375 DEBUG urllib3.connectionpool _make_request:401 "PUT /_snapshot/backups/curator-20160719135353?wait_for_completion=true HTTP/1.1" 503 243 2016-07-19 15:55:05,375 WARNING elasticsearch log_request_fail:82 PUT /_snapshot/backups/curator-20160719135353?wait_for_completion=true [status:503 request:0.038s] 2016-07-19 15:55:05,375 DEBUG elasticsearch log_request_fail:90 > {"indices": "geo_locations_offers_production,geo_shapes,offers_v3_production,ranking_medians_production", "partial": false, "ignore_unavailable": false, "include_global_state": true} 2016-07-19 15:55:05,375 DEBUG elasticsearch log_request_fail:93 < {"error":"RemoteTransportException[[ip-ha-ha-ha-es-01][inet[/ha.ha.ha.ha:9300]][cluster:admin/snapshot/create]]; nested: ConcurrentSnapshotExecutionException[[backups:curator-20160719135353] a snapshot is already running]; ","status":503} 2016-07-19 15:55:05,376 DEBUG urllib3.util.retry from_int:170 Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0) 2016-07-19 15:55:05,413 DEBUG urllib3.connectionpool _make_request:401 "PUT /_snapshot/backups/curator-20160719135353?wait_for_completion=true HTTP/1.1" 503 243 2016-07-19 15:55:05,413 WARNING elasticsearch log_request_fail:82 PUT /_snapshot/backups/curator-20160719135353?wait_for_completion=true [status:503 request:0.038s] 2016-07-19 15:55:05,414 DEBUG elasticsearch log_request_fail:90 > {"indices": "geo_locations_offers_production,geo_shapes,offers_v3_production,ranking_medians_production", "partial": false, "ignore_unavailable": false, "include_global_state": true} 2016-07-19 15:55:05,414 DEBUG elasticsearch log_request_fail:93 < {"error":"RemoteTransportException[[ip-ha-ha-ha-es-01][inet[/ha.ha.ha.ha:9300]][cluster:admin/snapshot/create]]; nested: ConcurrentSnapshotExecutionException[[backups:curator-20160719135353] a snapshot is already running]; ","status":503} 2016-07-19 15:55:05,414 ERROR curator.api.snapshot create_snapshot:101 Client raised a TransportError. 2016-07-19 15:55:05,414 WARNING curator.cli.utils exit_msg:69 Job did not complete successfully.

This is the essence of this error message

In general this error means that you need to wait until the currently running snapshot finishes before starting a new one. Elasticsearch can only perform one snapshot at a time. It's possible that your snapshot is running for too long because of a large amount of data and takes more than a day. It's also possible that you are hitting one of the snapshot bugs that is causing snapshots to hang in some situations. This would typically coincide with general cluster instability (like master nodes living the cluster due to overload or network issues). ES 1.4.2 has several known issue that could manifest themselves this way. Most of them were fixed in v1.6-1.7. So, I would suggest to at least upgrade to 1.7.5 if you cannot upgrade to 2.x branch just yet.

Thanks Igor for your answer! Sadly we cannot upgrade our current version for now (we have some dependencies that will break if we do so)
What I'm finding weird is that even after getting the error saying that another snapshot is already running, this is what I see when I query the status:

{
"snapshot": "curator-20160719135353",
"indices": [
"geo_locations_offers_production",
"geo_shapes",
"offers_v3_production",
"ranking_medians_production"
],
"state": "SUCCESS",
"start_time": "2016-07-19T13:53:54.626Z",
"start_time_in_millis": 1468936434626,
"end_time": "2016-07-19T13:58:25.972Z",
"end_time_in_millis": 1468936705972,
"duration_in_millis": 271346,
"failures": ,
"shards": {
"total": 40,
"failed": 0,
"successful": 40
}
}

So, although curator is failing, the snapshots are actually created

Is it possible that you have another instance of curator running somewhere that tries to create or delete a snapshot at the same time? You can check the timestamps on snapshots (start_time and end_time in the output above) and compare them to the timestamps on the error messages to see who if you can identify who is trying to start multiple snapshots simultaneously.

Hey, hello!
It's the only instance running, it's the only ES cluster in the network and I checked that there are no running snapshots by querying /_snapshot/backups/_current

I will try to upgrade to 1.7 and see if the issue is still there

Thanks!

I don't want to discourage you from upgrading (1.7 is a huge improvement with several serious issues fixed) but I don't think it will help this particular issue then.

On the positive side, I just spoke to the creator of curator and he pointed out that your issue looks very similar to this one. If you have a proxy or a load balancer installed between the machine that runs curator and elasticsearch, it's possible that this proxy cuts the long running snapshot call, which causes curator to retry this call a couple of times. The retries fail with "a snapshot is already running" exception because the original snapshot is still running. It doesn't affect the original snapshot though, which finishes normally. That would explain the log messages that you are getting.

Hey Igor, I'm so sorry for not answering before... Actually that was the issue, our ES cluster is running under an ELB and the snapshot was taking way longer than the default connection timeout; after increasing it (to a way higher number) it's working perfectly fine

Thanks for all your help! I really appreciate it