Observability Alerts - Recreate Watcher into Threshold Alert

erikg · January 2, 2024, 7:07pm

Hello,

I am trying to recreate this Watcher into a Metric Threshold alert:

{
  "trigger": {
    "schedule": {
      "cron": "0 */1 15-23 ? * MON-SUN"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "metricbeat-*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "aggs": {
            "host": {
              "terms": {
                "field": "host.name",
                "order": {
                  "memory_usage": "desc"
                }
              },
              "aggs": {
                "memory_usage": {
                  "avg": {
                    "field": "system.memory.used.pct"
                  }
                },
                "avg_bucket_filter": {
                  "bucket_selector": {
                    "buckets_path": {
                      "totalAvg": "memory_usage"
                    },
                    "script": "params.totalAvg >= {{ctx.metadata.threshold_min}} && params.totalAvg <= {{ctx.metadata.threshold_max}}"
                  }
                },
                "aggs": {
                  "filters": {
                    "filters": {
                      "history": {
                        "range": {
                          "@timestamp": {
                            "gte": "now-15m",
                            "lte": "now"
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          },
          "timeout": "60s",
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-15m",
                      "lte": "now"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "condition": {
    "script": {
      "source": "return ctx.payload.aggregations.host.buckets.size() > 0",
      "lang": "painless"
    }
  },
  "actions": {
    "email_admin": {
      "throttle_period_in_millis": 3600000,
      "email": {
        "profile": "standard",
        "to": [
          "<email>"
        ],
        "subject": "Outage Alert: Memory used {{#ctx.payload.hosts}} {{memory_usage}}% for {{key}} {{/ctx.payload.hosts}}",
        "body": {
          "html": "<html>  <h1> Alert: High Memory Usage </h1> {{#ctx.payload.hosts}} Reason: {{memory_usage}}% for {{key}} in the last {{ctx.metadata.window_period}}. Alert when between 98% - 100%. <br>  {{/ctx.payload.hosts}} <br> This message was sent by Elastic. <a href='<URL placeholder>'> View rule in Kibana.</a></html>"
        }
      }
    }
  },
  "metadata": {
    "threshold_max": 1,
    "window_period": "15m",
    "threshold_min": 0.98
  },
  "transform": {
    "script": {
      "source": "def threshold_p = ctx.metadata.threshold_min*100; return [ 'threshold': (int)threshold_p, 'hosts': ctx.payload.aggregations.host.buckets.stream().map(p -> [ 'key': p.key, 'memory_usage': (int) (p.memory_usage.value*100)]).collect(Collectors.toList()) ];",
      "lang": "painless"
    }
  }
}

To explain what the watcher alert does:

calculates the average memory usage for each minute
checks if it's between a threshold
determines if the 15 minutes of average memory(s) were consistently within the threshold
(for more clarification, it has to be consistently 15 times where the average CPU meets the threshold)
if they were, alert on it

I made the watcher awhile back and now with out of the box alerts , seems like they are advanced to do the same thing.

I have already attempted to create a metric threshold alert but can't seem to get it right. I think the part I am struggling to translate is this logic:

determine if the 15 minutes of average memory(s) were consistently within the threshold
if they were, alert on it

Thanks,
Erik

maryam-saeidi · January 5, 2024, 9:31am

Hi Erik,

Your request seems similar to this feature request:

github.com/elastic/kibana

[Discuss] Alert only after the metrics threshold is met X times - Customer request

opened 02:42PM - 25 Jan 21 UTC

arisonl

discuss enhancement Feature:Alerting Team:Observability Team:ResponseOps NeededFor:logs-metrics-ui Feature:Alerting/RulesFramework estimate:needs-research

As part of [alert flapping mitigation](https://github.com/elastic/kibana/issues/…49412), customers have requested the following feature: Alert only after a metrics threshold is met X times (the request is made in the context of observability metrics alerting). Let's use this issue to determine across the observability and the alerting services team what is the best path for this feature. Is it a metrics alert type feature or something we may provide on the framework level? Is it an alert feature or an action (perhaps advanced throttling) feature? Tagging both teams. cc @mukeshelastic @sorantis @elastic/kibana-alerting-services

If that's the case, adding your example to this issue as a comment will help prioritization.

maryam-saeidi · January 5, 2024, 4:00pm

Another related feature request:

github.com/elastic/kibana

Alert creation delay based on user definition

opened 09:39AM - 10 Dec 23 UTC

shanisagiv1

Team:ResponseOps

**Describe the feature:** In order to reduce noise for alerting rules with low …sensitivity and ensure created alerts will be actionable and reasonable, we want to allow users to define how many rule executions should match before creating the alert. **Describe a specific use case for the feature:** - The user should be able to define a new input X. This input should be available **for each rule type**. This X should be called "Rule query threshold should met X times before alert generations". - When determining this input, the rule should run X times and match X times before creating alerts (and triggering its actions). e.g: the rule will run 4 times and has to match the defined threshold 4 times before creating the alert. - This input won't affect the alert recovery, which means alert recovery will work as today, when a rule returns no results, life-cycled alerts are recovered automatically. - Out of scope for this task, but might be an extension in the future is "Delayed Recovery", which means we'll have to leverage the same X for delay the recovery based the same logic or to use a separated Y for that. but it's out of scope for this task - This logic should be implemented for life-cycled alerts (stack and o11y) only and be available in the current rule flyout and API. Related to this: https://github.com/elastic/kibana/issues/146220

Topic		Replies	Views
Watcher with different thresholds Elastic Observability elastic-stack-monitoring	2	222	December 11, 2023
Group Watcher results per host Elastic Observability elastic-stack-alerting	3	516	November 4, 2022
Watcher Alerts from MetricBeat metrics Beats metricbeat	3	2367	July 14, 2018
Watcher on ECE - Comparing an Aggs result to a threshold Elasticsearch elastic-stack-alerting	4	616	November 8, 2019
Watcher Not Trigger for CPU utilization Elasticsearch elastic-stack-alerting	3	895	July 24, 2018

Observability Alerts - Recreate Watcher into Threshold Alert

Related topics