Supporting retries on Close operation

I was going to open a feature request/question on Github but was guided here instead.

I was interested in Curator supporting retry_count &retry_interval for the close operation, similar to how it exists for delete_snapshots

One of our curator configs runs daily at 2am and takes care of a number of tasks against daily indices. Included in those tasks is a close operation followed by an index_settings operation to change the codec of the index to best_compression. The indices are then reopened and a forcemerge is performed.

Occasionally this action fails if one of the indices it is trying to close (and therefore sync flush) has any ongoing indexing operations. In an ideal world all indexing operations on these indices should halt at midnight (and largely they do) but there can occasionally be a small amount of lagged indexing operations still happening (e.g. a backlog in events from a log shipper).

Example error:

2021-02-08 02:00:45,873 INFO      Closing selected indices: [u'dev-jaeger-service-2021-02-07', u'bcp-jaeger-span-2021-02-07', u'preprod-jaeger-service-2021-02-07', u'prod-jaeger-span-2021-02-07', u'prod-jaeger-service-2021-02-07', u'bcp-jaeger-service-2021-02-07', u'preprod-jaeger-span-2021-02-07', u'dev-jaeger-span-2021-02-07']
2021-02-08 02:01:23,562 ERROR     Failed to complete action: close.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConflictError(409, u'{"_shards":{"total":240,"successful":239,"failed":1},"bcp-jaeger-span-2021-02-07":{"total":30,"successful":30,"failed":0},"preprod-jaeger-span-2021-02-07":{"total":30,"successful":30,"failed":0},"preprod-jaeger-service-2021-02-07":{"total":30,"successful":30,"failed":0},"bcp-jaeger-service-2021-02-07":{"total":30,"successful":30,"failed":0},"prod-jaeger-service-2021-02-07":{"total":30,"successful":30,"failed":0},"dev-jaeger-service-2021-02-07":{"total":30,"successful":30,"failed":0},"dev-jaeger-span-2021-02-07":{"total":30,"successful":30,"failed":0},"prod-jaeger-span-2021-02-07":{"total":30,"successful":29,"failed":1,"failures":[{"shard":9,"reason":"pending operations","routing":{"state":"STARTED","primary":false,"node":"A2iAem-CSmWblYbRUmrjOA","relocating_node":null,"shard":9,"index":"prod-jaeger-span-2021-02-07","allocation_id":{"id":"ri47qAoYT8aSFqC6K_1Wug"}}}]}}')

In most cases a subsequent retry of the job, even only a minute or so later will succeed which is why I feel supporting retries would be beneficial.

Other workarounds/options I've thought about:

  • Run the job at a later time which would reduce the chance of the close operation happening -- This would probably reduce the chance of hitting this but would push back all following actions (like forcemerging which can take a while) when in a lot of cases it wouldn't be necessary
  • Perform the close operation after allocation operations moving data to warm nodes -- After more reading I realised wait_for_completion on the allocation stage doesn't actually wait for all shards to have reallocated. My initial thought was this would buy extra time until the operation needed to be done, but in actuality, it's only a matter of seconds later
  • Set ignore_sync_failures to true -- This might be the correct option however my understanding is if I ignore any sync failures, when the index is reopened it will have to be rebuilt which could take significant time on bigger indices (Our biggest indices are ~1.7TB daily)
  • Set skip_flush to true -- Same as above

Perhaps I'm missing a better way to handle this that I've not thought about?


I can see why this would be nice in your case, but if you have to retry a close operation, it generally means your cluster is overloaded, overworked, or that it's unable to keep the cluster state in sync within a short window of seconds (or all of the above). It means your request to close an index is timing out, or that the cluster state update is timing out. That's bad. Like, cluster is very unhealthy bad. Sure, I could add this retry functionality, but it won't address the underlying problem, which is that your cluster is not happy in some way.

EDIT: Looking closer at the actual error, consider waiting longer before attempting to close the index. You're seeing sync_flush failures. If it works after only a few more minutes, then it means there were ongoing writes to the index when you attempted to close it (hence the sync_flush failures. If you want it to proceed anyway—which is a risk to a slower re-opening of the index after the settings change—you can always use the ignore_sync_flush_failures option for the close action.

Thanks @theuntergeek

If it works after only a few more minutes, then it means there were ongoing writes to the index when you attempted to close it

This is exactly it, whilst things should have stopped writing to the index at this point, occasionally there's still a trickle of documents coming through. We run curator via a kubernetes cronjob and we often see a retry of the job (running seconds after the failure) succeed, however sometimes it might be a few minutes before indexing is done (in which case the job has usually reached its retry limit). Pushing the job back is certainly an option but since it succeeds 9 times out of 10 it's nice to have it at the earlier time which allows forcemerges etc. to happen out of hours.

I think ultimately what I'm after is an exponential retry with larger beginning intervals (i.e. 1 minute, 5 minutes, 20 minutes etc.) -- Perhaps I just need to wrap curator in a script that does this itself