Hello,
I am trying to recreate this Watcher into a Metric Threshold alert:
{
"trigger": {
"schedule": {
"cron": "0 */1 15-23 ? * MON-SUN"
}
},
"input": {
"search": {
"request": {
"search_type": "query_then_fetch",
"indices": [
"metricbeat-*"
],
"rest_total_hits_as_int": true,
"body": {
"aggs": {
"host": {
"terms": {
"field": "host.name",
"order": {
"memory_usage": "desc"
}
},
"aggs": {
"memory_usage": {
"avg": {
"field": "system.memory.used.pct"
}
},
"avg_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalAvg": "memory_usage"
},
"script": "params.totalAvg >= {{ctx.metadata.threshold_min}} && params.totalAvg <= {{ctx.metadata.threshold_max}}"
}
},
"aggs": {
"filters": {
"filters": {
"history": {
"range": {
"@timestamp": {
"gte": "now-15m",
"lte": "now"
}
}
}
}
}
}
}
}
},
"timeout": "60s",
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-15m",
"lte": "now"
}
}
}
]
}
}
}
}
}
},
"condition": {
"script": {
"source": "return ctx.payload.aggregations.host.buckets.size() > 0",
"lang": "painless"
}
},
"actions": {
"email_admin": {
"throttle_period_in_millis": 3600000,
"email": {
"profile": "standard",
"to": [
"<email>"
],
"subject": "Outage Alert: Memory used {{#ctx.payload.hosts}} {{memory_usage}}% for {{key}} {{/ctx.payload.hosts}}",
"body": {
"html": "<html> <h1> Alert: High Memory Usage </h1> {{#ctx.payload.hosts}} Reason: {{memory_usage}}% for {{key}} in the last {{ctx.metadata.window_period}}. Alert when between 98% - 100%. <br> {{/ctx.payload.hosts}} <br> This message was sent by Elastic. <a href='<URL placeholder>'> View rule in Kibana.</a></html>"
}
}
}
},
"metadata": {
"threshold_max": 1,
"window_period": "15m",
"threshold_min": 0.98
},
"transform": {
"script": {
"source": "def threshold_p = ctx.metadata.threshold_min*100; return [ 'threshold': (int)threshold_p, 'hosts': ctx.payload.aggregations.host.buckets.stream().map(p -> [ 'key': p.key, 'memory_usage': (int) (p.memory_usage.value*100)]).collect(Collectors.toList()) ];",
"lang": "painless"
}
}
}
To explain what the watcher alert does:
- calculates the average memory usage for each minute
- checks if it's between a threshold
- determines if the 15 minutes of average memory(s) were consistently within the threshold
(for more clarification, it has to be consistently 15 times where the average CPU meets the threshold) - if they were, alert on it
I made the watcher awhile back and now with out of the box alerts , seems like they are advanced to do the same thing.
I have already attempted to create a metric threshold alert but can't seem to get it right. I think the part I am struggling to translate is this logic:
- determine if the 15 minutes of average memory(s) were consistently within the threshold
- if they were, alert on it
Thanks,
Erik