I am trying to write a watcher. I've tested the search expression on the console, and it appears to work. When I use "Simulate" within Kibana, it says that the trigger should fire. However, it isn't firing - the UI shows it as not having been triggered.
I have seen the same behavior in ES / Kibana 7.1.1 and 7.4.0
The specific watcher is trying to alert if the average idle CPU on our kubernetes cluster has been below a threshold for the last 15 minutes. To try to test the watcher, I've made the threshold 90% (0.9) - production would be much lower. So this should fire if system.cpu.idle.norm.pct averages to < 0.9 for the last 15 minutes, grouped by host.name
Watcher code:
{
"trigger": {
"schedule": {
"interval": "1m"
}
},
"input": {
"search": {
"request": {
"search_type": "query_then_fetch",
"indices": [
"metricbeat-*"
],
"rest_total_hits_as_int": true,
"body": {
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "now-15m"
}
}
},
{
"term": {
"fields.cluster_name": "review"
}
}
]
}
},
"aggs": {
"per_host": {
"terms": {
"field": "host.name",
"size": 30
},
"aggs": {
"avg_cpu_idle": {
"avg": {
"field": "system.cpu.idle.norm.pct"
}
},
"cpu_in_use": {
"bucket_script": {
"buckets_path": {
"avg_cpu_idle": "avg_cpu_idle"
},
"script": "Math.round( (1 - params.avg_cpu_idle) * 1000) / 10.0"
}
},
"filtered": {
"bucket_selector": {
"buckets_path": {
"idle": "avg_cpu_idle"
},
"script": "params.idle < 0.9"
}
}
}
}
}
}
}
}
},
"condition": {
"array_compare": {
"ctx.payload.aggregations.per_host.buckets": {
"path": "avg_cpu_idle.value",
"lte": {
"value": 0.9,
"quantifier": "some"
}
}
}
},
"actions": {
"send_email": {
"email": {
"profile": "standard",
"to": [
"my.email@example.com"
],
"subject": "Review Apps: High CPU usage",
"body": {
"text": "Environment Review Apps High CPU usage: The following nodes have high CPU over the past 15 minutes: {{#ctx.payload.aggregations.per_host.buckets}}\n\n{{key}}: {{cpu_in_use.value}}%{{/ctx.payload.aggregations.per_host.buckets}}"
}
}
}
},
"throttle_period_in_millis": 21600000
}
Things that might be related:
- I am using the standard metricbeat kubernetes setup as on https://www.elastic.co/guide/en/beats/metricbeat/current/running-on-kubernetes.html - this exports data to an index that is used for multiple days, and only rolls over on a data limit - so today's data (2019-10-9) is still going into index metricbeat-7.3.2-2019.09.30-000001. I think ES uses some optimization to skip indexes that don't relate to the correct date - could that be the problem?