Best practices for monitoring curator snapshot (failures)

@theuntergeek
I have a nice curator --> AWS S3 snapshot pipeline for backing up my clusters, but from time-to-time a snapshot will fail and thus I'm wondering what the best way to monitor snapshot failures is?

Right now I'm rsyslogging all of my curator events to my cluster and I was planning on using watchers to email based on certain conditions (like failures). Rsyslog works great and I've no doubt the watchers will technically work...but I don't know what the best thing to scrape for would be? I've thought about watching for the literal exception strings thrown on the various snapshot failure types and I think that will work to a certain extent. But my concern is if I update curator or ES or whatever and those exception messages change then my watchers will break. If you have any good ideas of what to 'watch' for or a totally different approach not involving watchers or log parsing I'd be very interested to know your thoughts.

Thanks,

Jonas Steinberg

Small tangent, but related:

My curator jobs run on cron. I had thought of just sending an email every time curator's exit was !0 but even when I break the jobs intentionally curator still exits 0. Thoughts?

Curator does perform a check for SUCCESS state, which runs immediately after it has detected that the snapshot has completed. This check is only performed, however, if wait_for_completion is set to true.

If an exception is not causing a non-zero exit code, do you have continue_if_exception set to true? Otherwise, this sounds like a bug.

@theuntergeek

  1. Yes I have wait_for_completion set to true, so regarding an expression or literal to 'watcher' for: what would you recommend? It would be nice if the watcher only looked for the SUCCESS state only after the snapshot either completes or breaks? I'm not asking you to write the watcher, but merely an expression to use or something like that.

  2. continue_if_exception is set to its default of false. I've looked into this situation more and it seems that curator is working reasonably.

While a snapshot is running, its state is IN_PROGRESS. Curator will continue to check this until a different state is reported, any of SUCCESS, PARTIAL, FAILED, or anything else not IN_PROGRESS. The final state is checked again, and non SUCCESS states result in an exception being raised.

If you want it to be reported on, you could make API calls with watches to check for a snapshot in progress, and to see if any of the previous snapshots finished with a state other than SUCCESS.

A future release of Curator will log each action's results to an index, which will be easier to use with watches. But that's not ready for consumption yet.

Will do!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.