Good place to locate Kibana\Watcher expert?

I'm looking for someone to help us understand and create Watcher alerts for Metricbeat data coming in. Basic alerting such as netowrk, disk, cpu, memory.

Is there a place in the community for this? Or does anyone have suggestions on outside locations where I could look?

Thanks,
KG

Assuming you have a licence why not create a support request. Thats what I do when I need help...

This was their reply: "It looks like you are asking for assistance with configuring Watcher. As you are a Cloud Standard customer, we are able to assist you with break/fix issues on your ESS cluster. This question appears to be outside of this break/fix scope."

I can't help you finding an expert/consultant, but if you start writing your down your issue in a bit more detail, and probably also check out the examples alerting repo to get up and running that might be a first step towards a possible solution and more understanding.

And of course, there are Elastic commercial offerings like subscriptions, but I am not super sure, if you are after that.

Hope that helps!

I will post my Watch here. This is running on Elastic Cloud, and I used the repo to get me started.

The purpose of this Watch was to alert when any disk volume is above 80%. It does not fire:

    {
      "trigger": {
    "schedule": {
      "interval": "5m"
    }
      },
      "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "metricbeat-*"
        ],
        "types": [
          "filesystem"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "aggs": {
            "host": {
              "terms": {
                "field": "host.hostname",
                "order": {
                  "disk_usage": "desc"
                }
              },
              "aggs": {
                "disk_usage": {
                  "max": {
                    "field": "system.filesystem.used.pct"
                  }
                }
              }
            }
          },
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "gte": "now-{{ctx.metadata.window_period}}"
                    }
                  }
                },
                {
                  "range": {
                    "disk_usage": {
                      "gte": "{{ctx.metadata.threshold}}"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }
      },
      "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gt": 0
      }
    }
      },
      "actions": {
    "email_me": {
      "throttle_period_in_millis": 60000,
      "email": {
        "profile": "standard",
        "from": "username@example.org",
        "to": [
          "myemail@myemail.com"
        ],
        "subject": "Disk Full",
        "body": {
          "html": "Some hosts are over {{ctx.payload.threshold}}% utilized:{{#ctx.payload.hosts}}{{disk_usage}}%-{{key}}:{{/ctx.payload.hosts}}"
        }
      }
    },
    "log": {
      "logging": {
        "level": "info",
        "text": "Some hosts are over {{ctx.payload.threshold}}% utilized:{{#ctx.payload.hosts}}{{disk_usage}}%-{{key}}:{{/ctx.payload.hosts}}"
      }
    }
      },
      "metadata": {
    "window_period": "15m",
    "threshold": 0.8
      },
      "transform": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "log-events"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "query": {
            "match": {
              "status": "error"
            }
          }
        }
      }
    }
      }
    }

The output of the Execute Watch API would help a lot here.

Also this blog post I wrote a few years back is still valid and should help you to get into a fast write/debug loop when dealing with watches allowing you to easily figure out when something is wrong.

Thanks for that, reading through your blog post now. Here is the result from execute API:

#! Deprecation: [types removal] Specifying types in a watcher search request is deprecated.
{
  "_id" : "531b40c2-4e64-4a83-9eb4-989cd465f474_d31688ac-7e56-40ab-a5a8-07481c5b318f-2020-03-17T17:40:52.556865Z",
  "watch_record" : {
    "watch_id" : "531b40c2-4e64-4a83-9eb4-989cd465f474",
    "node" : "w88m1bFtSryU64LkC72IdA",
    "state" : "execution_not_needed",
    "user" : "elastic",
    "status" : {
      "state" : {
        "active" : true,
        "timestamp" : "2020-03-09T17:38:15.946Z"
      },
      "last_checked" : "2020-03-17T17:40:52.556Z",
      "actions" : {
        "email_me" : {
          "ack" : {
            "timestamp" : "2020-03-09T17:38:15.946Z",
            "state" : "awaits_successful_execution"
          }
        },
        "log" : {
          "ack" : {
            "timestamp" : "2020-03-09T17:38:15.946Z",
            "state" : "awaits_successful_execution"
          }
        }
      },
      "execution_state" : "execution_not_needed",
      "version" : 2354
    },
    "trigger_event" : {
      "type" : "manual",
      "triggered_time" : "2020-03-17T17:40:52.556Z",
      "manual" : {
        "schedule" : {
          "scheduled_time" : "2020-03-17T17:40:52.556Z"
        }
      }
    },
    "input" : {
      "search" : {
        "request" : {
          "search_type" : "query_then_fetch",
          "indices" : [
            "metricbeat-*"
          ],
          "types" : [
            "filesystem"
          ],
          "rest_total_hits_as_int" : true,
          "body" : {
            "aggs" : {
              "host" : {
                "terms" : {
                  "field" : "host.hostname",
                  "order" : {
                    "disk_usage" : "desc"
                  }
                },
                "aggs" : {
                  "disk_usage" : {
                    "max" : {
                      "field" : "system.filesystem.used.pct"
                    }
                  }
                }
              }
            },
            "query" : {
              "bool" : {
                "filter" : [
                  {
                    "range" : {
                      "@timestamp" : {
                        "gte" : "now-{{ctx.metadata.window_period}}"
                      }
                    }
                  },
                  {
                    "range" : {
                      "disk_usage" : {
                        "gte" : "{{ctx.metadata.threshold}}"
                      }
                    }
                  }
                ]
              }
            }
          }
        }
      }
    },
    "condition" : {
      "compare" : {
        "ctx.payload.hits.total" : {
          "gt" : 0
        }
      }
    },
    "metadata" : {
      "window_period" : "15m",
      "name" : "Disk Used Test - Karn",
      "threshold" : 0.8,
      "xpack" : {
        "type" : "json"
      }
    },
    "result" : {
      "execution_time" : "2020-03-17T17:40:52.556Z",
      "execution_duration" : 21,
      "input" : {
        "type" : "search",
        "status" : "success",
        "payload" : {
          "_shards" : {
            "total" : 46,
            "failed" : 0,
            "successful" : 46,
            "skipped" : 0
          },
          "hits" : {
            "hits" : [ ],
            "total" : 0,
            "max_score" : null
          },
          "took" : 20,
          "timed_out" : false,
          "aggregations" : {
            "host" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [ ]
            }
          }
        },
        "search" : {
          "request" : {
            "search_type" : "query_then_fetch",
            "indices" : [
              "metricbeat-*"
            ],
            "types" : [
              "filesystem"
            ],
            "rest_total_hits_as_int" : true,
            "body" : {
              "aggs" : {
                "host" : {
                  "terms" : {
                    "field" : "host.hostname",
                    "order" : {
                      "disk_usage" : "desc"
                    }
                  },
                  "aggs" : {
                    "disk_usage" : {
                      "max" : {
                        "field" : "system.filesystem.used.pct"
                      }
                    }
                  }
                }
              },
              "query" : {
                "bool" : {
                  "filter" : [
                    {
                      "range" : {
                        "@timestamp" : {
                          "gte" : "now-15m"
                        }
                      }
                    },
                    {
                      "range" : {
                        "disk_usage" : {
                          "gte" : "0.8"
                        }
                      }
                    }
                  ]
                }
              }
            }
          }
        }
      },
      "condition" : {
        "type" : "compare",
        "status" : "success",
        "met" : false,
        "compare" : {
          "resolved_values" : {
            "ctx.payload.hits.total" : 0
          }
        }
      },
      "actions" : [ ]
    },
    "messages" : [ ]
  }
}

So, the interesting part here is result.input.payload which contains the search response. That one shows that no hits have been found (hits.total is 0). This means the condition is false and thus nothing is triggered.

Have you tried extracting the query from the watch and rewrite it to match documents? Depending on the ES version there is no need for types anymore for example.

Alexander,

Thank you for all of your help. I spent a few hours going between your blog and my watch. I managed to re-craft the watch completely, and your advice on how to speed up testing and dev was a godsend. For those coming along behind, or also for any critique or comments, here is my watch now, which works correctly:

{
  "trigger": {
    "schedule": {
      "interval": "12h"
    }
  },
  "input": {
    "search": {
      "request": {
        "search_type": "query_then_fetch",
        "indices": [
          "metricbeat-*"
        ],
        "rest_total_hits_as_int": true,
        "body": {
          "size": 0,
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "@timestamp": {
                      "from": "now-15m"
                    }
                  }
                },
                {
                  "range": {
                    "system.filesystem.used.pct": {
                      "from": 0.85
                    }
                  }
                }
              ],
              "must": [
                {
                  "match_phrase": {
                    "system.filesystem.mount_point": "/cmdb"
                  }
                }
              ]
            }
          },
          "aggs": {
            "by_host": {
              "terms": {
                "field": "host.hostname",
                "size": "100"
              }
            },
            "by_disk": {
              "terms": {
                "field": "system.filesystem.mount_point",
                "size": "100"
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gt": 0
      }
    }
  },
  "actions": {
    "email_1": {
      "email": {
        "profile": "standard",
        "to": [
          "nobody@nowhere.com"
        ],
        "subject": "CMDB is above 85% on | {{#ctx.payload.aggregations.by_host.buckets}}{{key}} |{{/ctx.payload.aggregations.by_host.buckets}}.",
        "body": {
          "text": "CMDB is above 85% on | {{#ctx.payload.aggregations.by_host.buckets}}{{key}} |{{/ctx.payload.aggregations.by_host.buckets}}."
        }
      }
    }
  }
}
1 Like

Glad you got it working! Thanks for digging through all of this!