Alert when okay

Does watcher have a mechanism to alert when an alarm is resolved? I want an alert to signify when an alert transitioned from error to okay.

Hi @acchaulk,

Watcher can absolutely do this because it is just a matter of defining a condition to trigger on.

If the condition passes, such as when an "error" state is detected, then you can follow up by performing any action that is supported by Watcher.

Therefore, the issue that you may be having is how to define the right condition and there are a lot of different strategies that you can employ, depending on the complexity that you are willing to endure:

  • Create two Watches
    1. The first Watch triggers on detecting the "Okay -> Error" transition.
    2. The second Watch triggers on detecting the "Error -> Okay" transition.
    • This is generally the simplest approach.
  • Create a Watch with a chain input
    • Detect both scenarios in the same Watch using the separate inputs.
    • Use a script condition and most likely a script transform to compare the separate responses and determine if this is worth alerting against.
    • Report the transition in the action(s).
  • Create a Watch with a chain input, but store the state somewhere (or read it from .watcher-history-*)
    • This is effectively the same thing as the second one, but it allows lapses in running the actual Watch because you can remember the previous state rather than hoping to catch it in your current request.
      • This complicates the overall Watch, but it simplifies its behavior.
    • If you are not going to read the previous state from .watcher-history-*, then you need to create your own "state" index for remembering the last run(s).

In practice, I find myself starting with the first option and quickly building my way into the second option. Frequently I just stop there, but I sometimes find myself wanting a stateful safety-net, which is the third and final option.

As the author of most of the cluster alerts, you will find if you look at them that they are all the third option. At Elastic{ON} 2017, I noted that they are just Watches under the cover and you can actually look at them if you check out the .watches index. For example, here's the cluster status for my local cluster running 5.6, which I happened to have running locally (I chopped out the status portion, which is metadata that Watcher itself uses and shuffled it around to be the more traditional JSON order):

... response to follow in response because of character limit ...

3 Likes
{
  "metadata": {
    "name": "X-Pack Monitoring: Cluster Status (OjiYuMDJRSaONuhDec5NRg)",
    "xpack": {
      "severity": 2100,
      "cluster_uuid": "OjiYuMDJRSaONuhDec5NRg",
      "version_created": 5050199,
      "watch": "elasticsearch_cluster_status",
      "link": "elasticsearch/indices",
      "alert_index": ".monitoring-alerts-6",
      "type": "monitoring"
    }
  },
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "chain": {
      "inputs": [
        {
          "check": {
            "search": {
              "request": {
                "indices": [
                  ".monitoring-es-*"
                ],
                "body": {
                  "size": 1,
                  "query": {
                    "bool": {
                      "filter": [
                        {
                          "term": {
                            "cluster_uuid": "{{ctx.metadata.xpack.cluster_uuid}}"
                          }
                        },
                        {
                          "bool": {
                            "should": [
                              {
                                "term": {
                                  "_type": "cluster_state"
                                }
                              },
                              {
                                "term": {
                                  "type": "cluster_stats"
                                }
                              }
                            ]
                          }
                        }
                      ]
                    }
                  },
                  "_source": [
                    "cluster_state.status"
                  ],
                  "sort": [
                    {
                      "timestamp": {
                        "order": "desc"
                      }
                    }
                  ]
                }
              }
            }
          }
        },
        {
          "alert": {
            "search": {
              "request": {
                "indices": [
                  ".monitoring-alerts-6",
                  ".monitoring-alerts-2"
                ],
                "body": {
                  "size": 1,
                  "query": {
                    "bool": {
                      "filter": {
                        "term": {
                          "_id": "{{ctx.watch_id}}"
                        }
                      }
                    }
                  },
                  "terminate_after": 1,
                  "sort": [
                    {
                      "timestamp": {
                        "order": "desc"
                      }
                    }
                  ]
                },
                "search_type": "query_then_fetch"
              }
            }
          }
        }
      ]
    }
  },
  "condition": {
    "script": {
      "source": """
        ctx.vars.fails_check = ctx.payload.check.hits.total != 0 && ctx.payload.check.hits.hits[0]._source.cluster_state.status != 'green';
        ctx.vars.not_resolved = ctx.payload.alert.hits.total == 1 && ctx.payload.alert.hits.hits[0]._source.resolved_timestamp == null;

        return ctx.vars.fails_check || ctx.vars.not_resolved
      """,
      "lang": "painless"
    }
  },
  "transform": {
    "script": {
      "source": """
        def state = 'red';

        if (ctx.vars.fails_check) {
          state = ctx.payload.check.hits.hits[0]._source.cluster_state.status;
        }

        if (ctx.vars.not_resolved) {
          ctx.payload = ctx.payload.alert.hits.hits[0]._source;

          if (ctx.vars.fails_check == false) {
            ctx.payload.resolved_timestamp = ctx.execution_time;
          }
        } else {
          ctx.payload = ['timestamp': ctx.execution_time, 'metadata': ctx.metadata.xpack];
        }

        if (ctx.vars.fails_check) {
          ctx.payload.prefix = 'Elasticsearch cluster status is ' + state + '.';

          if (state == 'red') {
            ctx.payload.message = 'Allocate missing primary shards and replica shards.';
            ctx.payload.metadata.severity = 2100;
          } else {
            ctx.payload.message = 'Allocate missing replica shards.';
            ctx.payload.metadata.severity = 1100;
          }
        }

        ctx.payload.update_timestamp = ctx.execution_time;

        return ctx.payload;
      """,
      "lang": "painless"
    }
  },
  "actions": {
    "trigger_alert": {
      "index": {
        "index": ".monitoring-alerts-6",
        "doc_type": "doc",
        "doc_id": "OjiYuMDJRSaONuhDec5NRg_elasticsearch_cluster_status"
      }
    }
  }
}

If you look at the condition, it detects two things:

  1. ctx.vars.fails_check - Are we currently in an error state?
  2. ctx.vars.not_resolved - Were we previously in an error state? (Based on the previous state from the second chain input, if any existed)

From there, we proceed to do something with this Watch if either of those are true.

This avoids the need to constantly look for "are we okay?" because it's implied and we can therefore assume we're okay if we're not failing. Also, once we answer both of those questions, we can control the flow of all actions. The display then becomes a matter of transforming the checks into whatever we want them to look like. That's where you get the beefy transform script.

In the case of monitoring, we make use of the state both in the UI and in the Watch, which is a convenient win-win, but you could feel free to email (or Slack, HipChat, etc.) a more reader-friendly copy of the state as a secondary action, while also indexing the state for follow-on use.

Using the state allows you to build very robust alerts while also being able to survive unexpected downtime. A classic issue with many alerting tools is that they fail to look for alerts during any time that they were not running. For instance, imagine that Watcher was not running (for any reason) for a day. No Watches would be running during that period and anything that simply "looked back" would traditionally look for the time since the expected last run of the Watch, which tends to be the trigger time (1m in my case). That would stop you from ever catching a transition that occurred during that time frame without maintaining state somewhere. .watcher-history-* inherently does that for you, or you can manage it yourself similarly to how I've shown above.

... and a little more ...

In X-Pack monitoring 6.0, we have added optional email actions to our cluster alerts. For those, we track transitions to try to email them and transitions depend on the alert, but generally speaking you need something akin to is_new and is_sesolved. If there are intermediate stages, then you will want something like is_modified. From there, you would just add an action condition that triggers only when is_new || is_modified || is_resolved. And voila, you get actions firing on a per-transition basis that can fire after delays too.

{
  "actions": {
    "trigger_alert": {
      "index": {
        "index": ".monitoring-alerts-6",
        "doc_type": "doc",
        "doc_id": "OjiYuMDJRSaONuhDec5NRg_elasticsearch_cluster_status"
      }
    },
    "send_email": {
      "condition": {
        "script": {
          "script": "return ctx.vars.is_new || ctx.vars.is_modified || ctx.vars.is_resolved",
          "lang": "painless"
        }
      },
      "email": {
        "...": "..."
      }
    }
  }
}

Hope that helps,
Chris

2 Likes

@pickypg thanks, this was very helpful!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.