Observability Alerts - Recreate Watcher into Threshold Alert

Hello,

I am trying to recreate this Watcher into a Metric Threshold alert:

{
  "trigger": {
    "schedule": {
      "cron": "0 */1 15-23 ? * MON-SUN"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "metricbeat-*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "aggs": {
            "host": {
              "terms": {
                "field": "host.name",
                "order": {
                  "memory_usage": "desc"
                }
              },
              "aggs": {
                "memory_usage": {
                  "avg": {
                    "field": "system.memory.used.pct"
                  }
                },
                "avg_bucket_filter": {
                  "bucket_selector": {
                    "buckets_path": {
                      "totalAvg": "memory_usage"
                    },
                    "script": "params.totalAvg >= {{ctx.metadata.threshold_min}} && params.totalAvg <= {{ctx.metadata.threshold_max}}"
                  }
                },
                "aggs": {
                  "filters": {
                    "filters": {
                      "history": {
                        "range": {
                          "@timestamp": {
                            "gte": "now-15m",
                            "lte": "now"
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          },
          "timeout": "60s",
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-15m",
                      "lte": "now"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "condition": {
    "script": {
      "source": "return ctx.payload.aggregations.host.buckets.size() > 0",
      "lang": "painless"
    }
  },
  "actions": {
    "email_admin": {
      "throttle_period_in_millis": 3600000,
      "email": {
        "profile": "standard",
        "to": [
          "<email>"
        ],
        "subject": "Outage Alert: Memory used {{#ctx.payload.hosts}} {{memory_usage}}% for {{key}} {{/ctx.payload.hosts}}",
        "body": {
          "html": "<html>  <h1> Alert: High Memory Usage </h1> {{#ctx.payload.hosts}} Reason: {{memory_usage}}% for {{key}} in the last {{ctx.metadata.window_period}}. Alert when between 98% - 100%. <br>  {{/ctx.payload.hosts}} <br> This message was sent by Elastic. <a href='<URL placeholder>'> View rule in Kibana.</a></html>"
        }
      }
    }
  },
  "metadata": {
    "threshold_max": 1,
    "window_period": "15m",
    "threshold_min": 0.98
  },
  "transform": {
    "script": {
      "source": "def threshold_p = ctx.metadata.threshold_min*100; return [ 'threshold': (int)threshold_p, 'hosts': ctx.payload.aggregations.host.buckets.stream().map(p -> [ 'key': p.key, 'memory_usage': (int) (p.memory_usage.value*100)]).collect(Collectors.toList()) ];",
      "lang": "painless"
    }
  }
}

To explain what the watcher alert does:

  • calculates the average memory usage for each minute
  • checks if it's between a threshold
  • determines if the 15 minutes of average memory(s) were consistently within the threshold
    (for more clarification, it has to be consistently 15 times where the average CPU meets the threshold)
  • if they were, alert on it

I made the watcher awhile back and now with out of the box alerts , seems like they are advanced to do the same thing.

I have already attempted to create a metric threshold alert but can't seem to get it right. I think the part I am struggling to translate is this logic:

  • determine if the 15 minutes of average memory(s) were consistently within the threshold
  • if they were, alert on it

Thanks,
Erik

Hi Erik,

Your request seems similar to this feature request:

If that's the case, adding your example to this issue as a comment will help prioritization.

1 Like

Another related feature request:

1 Like