Watcher - simulate shows it should fire, but it isn't

Stuart_Moore · October 9, 2019, 9:50am

I am trying to write a watcher. I've tested the search expression on the console, and it appears to work. When I use "Simulate" within Kibana, it says that the trigger should fire. However, it isn't firing - the UI shows it as not having been triggered.

I have seen the same behavior in ES / Kibana 7.1.1 and 7.4.0

The specific watcher is trying to alert if the average idle CPU on our kubernetes cluster has been below a threshold for the last 15 minutes. To try to test the watcher, I've made the threshold 90% (0.9) - production would be much lower. So this should fire if system.cpu.idle.norm.pct averages to < 0.9 for the last 15 minutes, grouped by host.name

Watcher code:

{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "metricbeat-*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-15m"
                    }
                  }
                },
                {
                  "term": {
                    "fields.cluster_name": "review"
                  }
                }
              ]
            }
          },
          "aggs": {
            "per_host": {
              "terms": {
                "field": "host.name",
                "size": 30
              },
              "aggs": {
                "avg_cpu_idle": {
                  "avg": {
                    "field": "system.cpu.idle.norm.pct"
                  }
                },
                "cpu_in_use": {
                  "bucket_script": {
                    "buckets_path": {
                      "avg_cpu_idle": "avg_cpu_idle"
                    },
                    "script": "Math.round( (1 - params.avg_cpu_idle) * 1000) / 10.0"
                  }
                },
                "filtered": {
                  "bucket_selector": {
                    "buckets_path": {
                      "idle": "avg_cpu_idle"
                    },
                    "script": "params.idle < 0.9"
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.per_host.buckets": {
        "path": "avg_cpu_idle.value",
        "lte": {
          "value": 0.9,
          "quantifier": "some"
        }
      }
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "profile": "standard",
        "to": [
          "my.email@example.com"
        ],
        "subject": "Review Apps: High CPU usage",
        "body": {
          "text": "Environment Review Apps High CPU usage: The following nodes have high CPU over the past 15 minutes: {{#ctx.payload.aggregations.per_host.buckets}}\n\n{{key}}: {{cpu_in_use.value}}%{{/ctx.payload.aggregations.per_host.buckets}}"
        }
      }
    }
  },
  "throttle_period_in_millis": 21600000
}

Things that might be related:

I am using the standard metricbeat kubernetes setup as on https://www.elastic.co/guide/en/beats/metricbeat/current/running-on-kubernetes.html - this exports data to an index that is used for multiple days, and only rolls over on a data limit - so today's data (2019-10-9) is still going into index metricbeat-7.3.2-2019.09.30-000001. I think ES uses some optimization to skip indexes that don't relate to the correct date - could that be the problem?

spinscale · October 9, 2019, 1:14pm

Hey Stuart,

regarding your question. That will not be the problem, as you are querying all the metricbeat-* indices and the optimization is not based on the index name.

Can you share the output of the execute watch API as well as the output of the watcher history for this watch.

GET .watcher-history-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "watch_id": "YOUR_WATCH_ID"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "trigger_event.triggered_time": {
        "order": "desc"
      }
    }
  ]
}

This way we can check if the actions have triggered properly.

--Alex

Stuart_Moore · October 9, 2019, 1:32pm

I've shared the top hit below.What I find weirdest is that the payload says there are 0 shards.

{
"_index" : ".watcher-history-10-2019.10.09",
"_type" : "_doc",
"_id" : "review_apps_node_cpu_usage_2ea5423a-4ba5-4c51-bc8c-16d6ee9e79f9-2019-10-09T10:31:47.751492Z",
"_score" : null,
"_source" : {
    "watch_id" : "review_apps_node_cpu_usage",
    "node" : "g3622LczT2aOY6Efv98emQ",
    "state" : "execution_not_needed",
    "user" : "watcher_setup",
    "status" : {
    "state" : {
        "active" : true,
        "timestamp" : "2019-10-08T12:42:15.938Z"
    },
    "last_checked" : "2019-10-09T10:31:47.751Z",
    "actions" : {
        "send_email" : {
        "ack" : {
            "timestamp" : "2019-10-08T12:42:15.938Z",
            "state" : "awaits_successful_execution"
        }
        }
    },
    "execution_state" : "execution_not_needed",
    "version" : -1
    },
    "trigger_event" : {
    "type" : "schedule",
    "triggered_time" : "2019-10-09T10:31:47.751Z",
    "schedule" : {
        "scheduled_time" : "2019-10-09T10:31:47.526Z"
    }
    },
    "input" : {
    "search" : {
        "request" : {
        "search_type" : "query_then_fetch",
        "indices" : [
            "metricbeat-*"
        ],
        "rest_total_hits_as_int" : true,
        "body" : {
            "query" : {
            "bool" : {
                "must" : [
                {
                    "range" : {
                    "@timestamp" : {
                        "gte" : "now-15m"
                    }
                    }
                },
                {
                    "term" : {
                    "fields.cluster_name": "review"
                    }
                }
                ]
            }
            },
            "aggs" : {
            "per_host" : {
                "terms" : {
                "field" : "host.name",
                "size" : 30
                },
                "aggs" : {
                "avg_cpu_idle" : {
                    "avg" : {
                    "field" : "system.cpu.idle.norm.pct"
                    }
                },
                "cpu_in_use" : {
                    "bucket_script" : {
                    "buckets_path" : {
                        "avg_cpu_idle" : "avg_cpu_idle"
                    },
                    "script" : "Math.round( (1 - params.avg_cpu_idle) * 1000) / 10.0"
                    }
                },
                "filtered" : {
                    "bucket_selector" : {
                    "buckets_path" : {
                        "idle" : "avg_cpu_idle"
                    },
                    "script" : "params.idle < 0.9"
                    }
                }
                }
            }
            }
        }
        }
    }
    },
    "condition" : {
    "array_compare" : {
        "ctx.payload.aggregations.per_host.buckets" : {
        "path" : "avg_cpu_idle.value",
        "lte" : {
            "value" : 0.9,
            "quantifier" : "some"
        }
        }
    }
    },
    "result" : {
    "execution_time" : "2019-10-09T10:31:47.751Z",
    "execution_duration" : 1,
    "input" : {
        "type" : "search",
        "status" : "success",
        "payload" : {
        "_shards" : {
            "total" : 0,
            "failed" : 0,
            "successful" : 0,
            "skipped" : 0
        },
        "hits" : {
            "hits" : [ ],
            "total" : 0,
            "max_score" : 0.0
        },
        "took" : 1,
        "timed_out" : false
        },
        "search" : {
        "request" : {
            "search_type" : "query_then_fetch",
            "indices" : [
            "metricbeat-*"
            ],
            "rest_total_hits_as_int" : true,
            "body" : {
            "query" : {
                "bool" : {
                "must" : [
                    {
                    "range" : {
                        "@timestamp" : {
                        "gte" : "now-15m"
                        }
                    }
                    },
                    {
                    "term" : {
                        "fields.cluster_name": "review"
                    }
                    }
                ]
                }
            },
            "aggs" : {
                "per_host" : {
                "terms" : {
                    "field" : "host.name",
                    "size" : 30
                },
                "aggs" : {
                    "avg_cpu_idle" : {
                    "avg" : {
                        "field" : "system.cpu.idle.norm.pct"
                    }
                    },
                    "cpu_in_use" : {
                    "bucket_script" : {
                        "buckets_path" : {
                        "avg_cpu_idle" : "avg_cpu_idle"
                        },
                        "script" : "Math.round( (1 - params.avg_cpu_idle) * 1000) / 10.0"
                    }
                    },
                    "filtered" : {
                    "bucket_selector" : {
                        "buckets_path" : {
                        "idle" : "avg_cpu_idle"
                        },
                        "script" : "params.idle < 0.9"
                    }
                    }
                }
                }
            }
            }
        }
        }
    },
    "condition" : {
        "type" : "array_compare",
        "status" : "success",
        "met" : false,
        "array_compare" : {
        "resolved_values" : {
            "ctx.payload.aggregations.per_host.buckets" : [ ]
        }
        }
    },
    "actions" : [ ]
    },
    "messages" : [ ]
},
"sort" : [
    1570617107751
]
}

Stuart_Moore · October 10, 2019, 11:35am

Apologies - my mistake entirely.
When I was simulating the watcher, it was using the logged in user. However I was creating the watcher via the api (using a script) with a different user that didn't have access to the appropriate indexes. Giving the watcher-creation user access to the metricbeat indexes solves the problem.

system · November 7, 2019, 11:36am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Watch Execution simulation not returning results Kibana elastic-stack-alerting	3	449	March 17, 2021
Watcher not firing Kibana elastic-stack-alerting	2	412	July 27, 2021
Scheduled Watches not Triggering Elasticsearch elastic-stack-alerting	10	2409	February 11, 2019
Multi Metric Watcher won't trigger Action even though condition is met Kibana elastic-stack-alerting , painless	9	407	July 20, 2021
Watcher Troubleshooting Elasticsearch elastic-stack-alerting	5	1389	October 22, 2020

Watcher - simulate shows it should fire, but it isn't

Related topics