Best practices for monitoring curator snapshot (failures)

jonassteinberg1 · November 27, 2018, 4:55pm

@theuntergeek
I have a nice curator --> AWS S3 snapshot pipeline for backing up my clusters, but from time-to-time a snapshot will fail and thus I'm wondering what the best way to monitor snapshot failures is?

Right now I'm rsyslogging all of my curator events to my cluster and I was planning on using watchers to email based on certain conditions (like failures). Rsyslog works great and I've no doubt the watchers will technically work...but I don't know what the best thing to scrape for would be? I've thought about watching for the literal exception strings thrown on the various snapshot failure types and I think that will work to a certain extent. But my concern is if I update curator or ES or whatever and those exception messages change then my watchers will break. If you have any good ideas of what to 'watch' for or a totally different approach not involving watchers or log parsing I'd be very interested to know your thoughts.

Thanks,

Jonas Steinberg

jonassteinberg1 · November 27, 2018, 5:19pm

Small tangent, but related:

My curator jobs run on cron. I had thought of just sending an email every time curator's exit was !0 but even when I break the jobs intentionally curator still exits 0. Thoughts?

theuntergeek · November 27, 2018, 5:27pm

Curator does perform a check for SUCCESS state, which runs immediately after it has detected that the snapshot has completed. This check is only performed, however, if wait_for_completion is set to true.

If an exception is not causing a non-zero exit code, do you have continue_if_exception set to true? Otherwise, this sounds like a bug.

jonassteinberg1 · November 27, 2018, 5:59pm

@theuntergeek

Yes I have wait_for_completion set to true, so regarding an expression or literal to 'watcher' for: what would you recommend? It would be nice if the watcher only looked for the SUCCESS state only after the snapshot either completes or breaks? I'm not asking you to write the watcher, but merely an expression to use or something like that.
continue_if_exception is set to its default of false. I've looked into this situation more and it seems that curator is working reasonably.

theuntergeek · November 27, 2018, 6:27pm

While a snapshot is running, its state is IN_PROGRESS. Curator will continue to check this until a different state is reported, any of SUCCESS, PARTIAL, FAILED, or anything else not IN_PROGRESS. The final state is checked again, and non SUCCESS states result in an exception being raised.

If you want it to be reported on, you could make API calls with watches to check for a snapshot in progress, and to see if any of the previous snapshots finished with a state other than SUCCESS.

A future release of Curator will log each action's results to an index, which will be easier to use with watches. But that's not ready for consumption yet.

jonassteinberg1 · November 27, 2018, 6:40pm

Will do!

system · December 25, 2018, 6:40pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Alerts mail to us if curator job doesn't work fine? Elasticsearch	7	1838	May 22, 2017
How to create watcher for failed snapshot? Elasticsearch	1	443	May 22, 2020
Curator Snapshot Failure Elasticsearch	2	473	September 26, 2018
Watcher to notify if only last snapshot failed Elasticsearch elastic-stack-alerting	3	629	December 16, 2020
Curator Multiple Snapshots at paticular time? Elasticsearch	6	1350	July 13, 2017

Best practices for monitoring curator snapshot (failures)

Related topics