Watcher - simulate shows it should fire, but it isn't

I am trying to write a watcher. I've tested the search expression on the console, and it appears to work. When I use "Simulate" within Kibana, it says that the trigger should fire. However, it isn't firing - the UI shows it as not having been triggered.

I have seen the same behavior in ES / Kibana 7.1.1 and 7.4.0

The specific watcher is trying to alert if the average idle CPU on our kubernetes cluster has been below a threshold for the last 15 minutes. To try to test the watcher, I've made the threshold 90% (0.9) - production would be much lower. So this should fire if system.cpu.idle.norm.pct averages to < 0.9 for the last 15 minutes, grouped by host.name

Watcher code:

{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "metricbeat-*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "query": {
            "bool": {
              "must": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-15m"
                    }
                  }
                },
                {
                  "term": {
                    "fields.cluster_name": "review"
                  }
                }
              ]
            }
          },
          "aggs": {
            "per_host": {
              "terms": {
                "field": "host.name",
                "size": 30
              },
              "aggs": {
                "avg_cpu_idle": {
                  "avg": {
                    "field": "system.cpu.idle.norm.pct"
                  }
                },
                "cpu_in_use": {
                  "bucket_script": {
                    "buckets_path": {
                      "avg_cpu_idle": "avg_cpu_idle"
                    },
                    "script": "Math.round( (1 - params.avg_cpu_idle) * 1000) / 10.0"
                  }
                },
                "filtered": {
                  "bucket_selector": {
                    "buckets_path": {
                      "idle": "avg_cpu_idle"
                    },
                    "script": "params.idle < 0.9"
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "array_compare": {
      "ctx.payload.aggregations.per_host.buckets": {
        "path": "avg_cpu_idle.value",
        "lte": {
          "value": 0.9,
          "quantifier": "some"
        }
      }
    }
  },
  "actions": {
    "send_email": {
      "email": {
        "profile": "standard",
        "to": [
          "my.email@example.com"
        ],
        "subject": "Review Apps: High CPU usage",
        "body": {
          "text": "Environment Review Apps High CPU usage: The following nodes have high CPU over the past 15 minutes: {{#ctx.payload.aggregations.per_host.buckets}}\n\n{{key}}: {{cpu_in_use.value}}%{{/ctx.payload.aggregations.per_host.buckets}}"
        }
      }
    }
  },
  "throttle_period_in_millis": 21600000
}

Things that might be related:

  • I am using the standard metricbeat kubernetes setup as on https://www.elastic.co/guide/en/beats/metricbeat/current/running-on-kubernetes.html - this exports data to an index that is used for multiple days, and only rolls over on a data limit - so today's data (2019-10-9) is still going into index metricbeat-7.3.2-2019.09.30-000001. I think ES uses some optimization to skip indexes that don't relate to the correct date - could that be the problem?

Hey Stuart,

regarding your question. That will not be the problem, as you are querying all the metricbeat-* indices and the optimization is not based on the index name.

Can you share the output of the execute watch API as well as the output of the watcher history for this watch.

GET .watcher-history-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "watch_id": "YOUR_WATCH_ID"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "trigger_event.triggered_time": {
        "order": "desc"
      }
    }
  ]
}

This way we can check if the actions have triggered properly.

--Alex

I've shared the top hit below.What I find weirdest is that the payload says there are 0 shards.

{
"_index" : ".watcher-history-10-2019.10.09",
"_type" : "_doc",
"_id" : "review_apps_node_cpu_usage_2ea5423a-4ba5-4c51-bc8c-16d6ee9e79f9-2019-10-09T10:31:47.751492Z",
"_score" : null,
"_source" : {
    "watch_id" : "review_apps_node_cpu_usage",
    "node" : "g3622LczT2aOY6Efv98emQ",
    "state" : "execution_not_needed",
    "user" : "watcher_setup",
    "status" : {
    "state" : {
        "active" : true,
        "timestamp" : "2019-10-08T12:42:15.938Z"
    },
    "last_checked" : "2019-10-09T10:31:47.751Z",
    "actions" : {
        "send_email" : {
        "ack" : {
            "timestamp" : "2019-10-08T12:42:15.938Z",
            "state" : "awaits_successful_execution"
        }
        }
    },
    "execution_state" : "execution_not_needed",
    "version" : -1
    },
    "trigger_event" : {
    "type" : "schedule",
    "triggered_time" : "2019-10-09T10:31:47.751Z",
    "schedule" : {
        "scheduled_time" : "2019-10-09T10:31:47.526Z"
    }
    },
    "input" : {
    "search" : {
        "request" : {
        "search_type" : "query_then_fetch",
        "indices" : [
            "metricbeat-*"
        ],
        "rest_total_hits_as_int" : true,
        "body" : {
            "query" : {
            "bool" : {
                "must" : [
                {
                    "range" : {
                    "@timestamp" : {
                        "gte" : "now-15m"
                    }
                    }
                },
                {
                    "term" : {
                    "fields.cluster_name": "review"
                    }
                }
                ]
            }
            },
            "aggs" : {
            "per_host" : {
                "terms" : {
                "field" : "host.name",
                "size" : 30
                },
                "aggs" : {
                "avg_cpu_idle" : {
                    "avg" : {
                    "field" : "system.cpu.idle.norm.pct"
                    }
                },
                "cpu_in_use" : {
                    "bucket_script" : {
                    "buckets_path" : {
                        "avg_cpu_idle" : "avg_cpu_idle"
                    },
                    "script" : "Math.round( (1 - params.avg_cpu_idle) * 1000) / 10.0"
                    }
                },
                "filtered" : {
                    "bucket_selector" : {
                    "buckets_path" : {
                        "idle" : "avg_cpu_idle"
                    },
                    "script" : "params.idle < 0.9"
                    }
                }
                }
            }
            }
        }
        }
    }
    },
    "condition" : {
    "array_compare" : {
        "ctx.payload.aggregations.per_host.buckets" : {
        "path" : "avg_cpu_idle.value",
        "lte" : {
            "value" : 0.9,
            "quantifier" : "some"
        }
        }
    }
    },
    "result" : {
    "execution_time" : "2019-10-09T10:31:47.751Z",
    "execution_duration" : 1,
    "input" : {
        "type" : "search",
        "status" : "success",
        "payload" : {
        "_shards" : {
            "total" : 0,
            "failed" : 0,
            "successful" : 0,
            "skipped" : 0
        },
        "hits" : {
            "hits" : [ ],
            "total" : 0,
            "max_score" : 0.0
        },
        "took" : 1,
        "timed_out" : false
        },
        "search" : {
        "request" : {
            "search_type" : "query_then_fetch",
            "indices" : [
            "metricbeat-*"
            ],
            "rest_total_hits_as_int" : true,
            "body" : {
            "query" : {
                "bool" : {
                "must" : [
                    {
                    "range" : {
                        "@timestamp" : {
                        "gte" : "now-15m"
                        }
                    }
                    },
                    {
                    "term" : {
                        "fields.cluster_name": "review"
                    }
                    }
                ]
                }
            },
            "aggs" : {
                "per_host" : {
                "terms" : {
                    "field" : "host.name",
                    "size" : 30
                },
                "aggs" : {
                    "avg_cpu_idle" : {
                    "avg" : {
                        "field" : "system.cpu.idle.norm.pct"
                    }
                    },
                    "cpu_in_use" : {
                    "bucket_script" : {
                        "buckets_path" : {
                        "avg_cpu_idle" : "avg_cpu_idle"
                        },
                        "script" : "Math.round( (1 - params.avg_cpu_idle) * 1000) / 10.0"
                    }
                    },
                    "filtered" : {
                    "bucket_selector" : {
                        "buckets_path" : {
                        "idle" : "avg_cpu_idle"
                        },
                        "script" : "params.idle < 0.9"
                    }
                    }
                }
                }
            }
            }
        }
        }
    },
    "condition" : {
        "type" : "array_compare",
        "status" : "success",
        "met" : false,
        "array_compare" : {
        "resolved_values" : {
            "ctx.payload.aggregations.per_host.buckets" : [ ]
        }
        }
    },
    "actions" : [ ]
    },
    "messages" : [ ]
},
"sort" : [
    1570617107751
]
}

Apologies - my mistake entirely.
When I was simulating the watcher, it was using the logged in user. However I was creating the watcher via the api (using a script) with a different user that didn't have access to the appropriate indexes. Giving the watcher-creation user access to the metricbeat indexes solves the problem.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.