I was going to open a feature request/question on Github but was guided here instead.
I was interested in Curator supporting retry_count
&retry_interval
for the close
operation, similar to how it exists for delete_snapshots
One of our curator configs runs daily at 2am and takes care of a number of tasks against daily indices. Included in those tasks is a close
operation followed by an index_settings
operation to change the codec of the index to best_compression
. The indices are then reopened and a forcemerge
is performed.
Occasionally this action fails if one of the indices it is trying to close (and therefore sync flush) has any ongoing indexing operations. In an ideal world all indexing operations on these indices should halt at midnight (and largely they do) but there can occasionally be a small amount of lagged indexing operations still happening (e.g. a backlog in events from a log shipper).
Example error:
2021-02-08 02:00:45,873 INFO Closing selected indices: [u'dev-jaeger-service-2021-02-07', u'bcp-jaeger-span-2021-02-07', u'preprod-jaeger-service-2021-02-07', u'prod-jaeger-span-2021-02-07', u'prod-jaeger-service-2021-02-07', u'bcp-jaeger-service-2021-02-07', u'preprod-jaeger-span-2021-02-07', u'dev-jaeger-span-2021-02-07']
2021-02-08 02:01:23,562 ERROR Failed to complete action: close. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConflictError(409, u'{"_shards":{"total":240,"successful":239,"failed":1},"bcp-jaeger-span-2021-02-07":{"total":30,"successful":30,"failed":0},"preprod-jaeger-span-2021-02-07":{"total":30,"successful":30,"failed":0},"preprod-jaeger-service-2021-02-07":{"total":30,"successful":30,"failed":0},"bcp-jaeger-service-2021-02-07":{"total":30,"successful":30,"failed":0},"prod-jaeger-service-2021-02-07":{"total":30,"successful":30,"failed":0},"dev-jaeger-service-2021-02-07":{"total":30,"successful":30,"failed":0},"dev-jaeger-span-2021-02-07":{"total":30,"successful":30,"failed":0},"prod-jaeger-span-2021-02-07":{"total":30,"successful":29,"failed":1,"failures":[{"shard":9,"reason":"pending operations","routing":{"state":"STARTED","primary":false,"node":"A2iAem-CSmWblYbRUmrjOA","relocating_node":null,"shard":9,"index":"prod-jaeger-span-2021-02-07","allocation_id":{"id":"ri47qAoYT8aSFqC6K_1Wug"}}}]}}')
In most cases a subsequent retry of the job, even only a minute or so later will succeed which is why I feel supporting retries would be beneficial.
Other workarounds/options I've thought about:
- Run the job at a later time which would reduce the chance of the close operation happening -- This would probably reduce the chance of hitting this but would push back all following actions (like forcemerging which can take a while) when in a lot of cases it wouldn't be necessary
-
Perform the close operation after
allocation
operations moving data to warm nodes -- After more reading I realisedwait_for_completion
on the allocation stage doesn't actually wait for all shards to have reallocated. My initial thought was this would buy extra time until the operation needed to be done, but in actuality, it's only a matter of seconds later -
Set
ignore_sync_failures
to true -- This might be the correct option however my understanding is if I ignore any sync failures, when the index is reopened it will have to be rebuilt which could take significant time on bigger indices (Our biggest indices are ~1.7TB daily) -
Set
skip_flush
to true -- Same as above
Perhaps I'm missing a better way to handle this that I've not thought about?
Cheers!