Watching watcher


(Joel Baranick) #1

It would be nice to be able to configure watches to alert when the queries fail to return within a configured timeout. The default timeout could be the schedule duration. This protects against long running queries causing watches to not fire.

(Steve Kearns) #2

This is a good question - I would love to learn more about your goal here. I imagine a few reasons:

  • Get notified when any configured watches exceed their configured timeouts
  • Ensure consistent watch execution times for short-interval watches

Are there other goals you had in mind?

Today, you can specify a timeout in the search input request body, and we do record the search execution information (e.g., and timed_out) in the watch history. These fields aren't indexed today, but it's something we could consider adding, so you could create a watch that looks at the watch history for timed_out ES queries.

(Joel Baranick) #3

If we are relying on watches to alert us when there is a production issue, then the watches are a critical piece of the infrastructure. As such, if watches stop running (or start timing out) it is a production live site which needs to be immediately addressed. This means we need to be alerted about watches which timeout, fail, or fail to run. Ideally, this notification would be resistant to elastricsearch cluster issues (red, lots of GC-ing, etc.).

(Steve Kearns) #4


Coming back around to this. I now see what you're after, and it makes a lot of sense.

Many of our customers use Marvel for monitoring Elasticsearch - it records metrics and telemetry from Elasticsearch over time. For larger clusters, we recommend storing the Marvel data in a separate monitoring cluster. In much the same way, you can run Watcher on a monitoring cluster and simply query your production cluster using the HTTP input:

We expected that monitoring Elasticsearch itself would be a common use-case, so we have provided a few examples of watches based on Marvel data:

Hope that helps!

(system) #5