Alert When - system_fails_to_provide_data - dynamic group of sending hosts

I have a use case to alert when a host fails to send logs. There is a watcher configured here, which is similar to what I'm trying to achieve:

I like the logic, aggregate hosts on last 24 hours, then check for last 5 minutes.

However when trying to modify for filebeat-*, I am getting an error.
Is using watcher the best form of alerting, or is there a simpler/more elegant approach?

I've moved this to the logs category, it was posted to Uptime which covers a different area.

Thanks sir. If I can be any more clear in the ask, I'll be happy to elaborate.

Seeking to obtain notice when 1 in N hosts ceases to send logs.

Where N may be 2k-3k.

Hi @bbek,

there's now an alerting functionality built into Kibana, that's intended to be simpler to work with than watcher. You could try to implement this using the "Log threshold" alert type. The limitation is that it provides less control over the specific semantics of the condition.

Regarding the error message you're seeing, it seems that the condition script from the example repo hasn't been added to the cluster.

Understood now! Appreciate the clarity.

I'm making progress toward fetching the "observer.names" (syslog sending hosts) - and appears script is demonstrating results.

I need some assistance in formatting and sharpening the output.

How do we extract the 'observer.name' - so that I customize an alert upon "match" of this (alerter approach is one way)

OR

Execute an email from Watcher , inclusive of the "flatlined" sending 'observer.name' when conditions match?

I realize the {{ctx.payload.hits}} is not the field I need, but it's the location I'd like the the 'observer.name' to be. Is it '_source.observer.name' ?

//
"text": "Systems not responding in the last {{ctx.metadata.last_period}} minutes:{{#ctx.payload._value}}{{.}}:{{/ctx.payload._value}} {{ctx.payload.hits}}
//

The watcher json:

{
  "metadata": {
    "window_period": "24h",
    "last_period":"5m"
  },
  "trigger": {
    "schedule": {
      "interval": "5m"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": "filebeat-*",
        "body": {
          "query": {
            "range": {
              "@timestamp": {
                "gte": "now-{{ctx.metadata.window_period}}"
              }
            }
          },
          "aggs": {
            "periods": {
              "filters": {
                "filters": {
                  "history": {
                    "range": {
                      "@timestamp": {
                        "gte": "now-{{ctx.metadata.window_period}}"
                      }
                    }
                  },
                  "last_period": {
                    "range": {
                      "@timestamp": {
                        "gte": "now-{{ctx.metadata.last_period}}"
                      }
                    }
                  }
                }
              },
              "aggs": {
                "hosts": {
                  "terms": {
                    "field": "observer.name",
                    "size": 10000
                  }
                }
              }
            }
          },
          "size": 0
        }
      }
    }
  },
  "condition": {
    "script": "return true"
  },
  "throttle_period": "5m",
  "actions": {
    "log": {
      "transform": {
      "script": "return true"
    },
      "logging": {
        "text": "Systems not responding in the last {{ctx.metadata.last_period}} minutes:{{#ctx.payload._value}}{{.}}:{{/ctx.payload._value}} {{ctx.payload.hits}}"
      }
    }
  }
}

Simulated Results:

> {
>   "watch_id": "_inlined_",
>   "node": "_KlItRP8Qp27WSYwbMYzsw",
>   "state": "executed",
>   "user": "elastic",
>   "status": {
>     "state": {
>       "active": true,
>       "timestamp": "2021-03-08T20:49:58.502Z"
>     },
>     "last_checked": "2021-03-08T20:49:58.503Z",
>     "last_met_condition": "2021-03-08T20:49:58.503Z",
>     "actions": {
>       "log": {
>         "ack": {
>           "timestamp": "2021-03-08T20:49:58.503Z",
>           "state": "ackable"
>         },
>         "last_execution": {
>           "timestamp": "2021-03-08T20:49:58.503Z",
>           "successful": true
>         },
>         "last_successful_execution": {
>           "timestamp": "2021-03-08T20:49:58.503Z",
>           "successful": true
>         }
>       }
>     },
>     "execution_state": "executed",
>     "version": -1
>   },
>   "trigger_event": {
>     "type": "manual",
>     "triggered_time": "2021-03-08T20:49:58.503Z",
>     "manual": {
>       "schedule": {
>         "scheduled_time": "2021-03-08T20:49:58.503Z"
>       }
>     }
>   },
>   "input": {
>     "search": {
>       "request": {
>         "search_type": "query_then_fetch",
>         "indices": [
>           "filebeat-*"
>         ],
>         "rest_total_hits_as_int": true,
>         "body": {
>           "query": {
>             "range": {
>               "@timestamp": {
>                 "gte": "now-{{ctx.metadata.window_period}}"
>               }
>             }
>           },
>           "aggs": {
>             "periods": {
>               "filters": {
>                 "filters": {
>                   "history": {
>                     "range": {
>                       "@timestamp": {
>                         "gte": "now-{{ctx.metadata.window_period}}"
>                       }
>                     }
>                   },
>                   "last_period": {
>                     "range": {
>                       "@timestamp": {
>                         "gte": "now-{{ctx.metadata.last_period}}"
>                       }
>                     }
>                   }
>                 }
>               },
>               "aggs": {
>                 "hosts": {
>                   "terms": {
>                     "field": "observer.name",
>                     "size": 10000
>                   }
>                 }
>               }
>             }
>           },
>           "size": 0
>         }
>       }
>     }
>   },
>   "condition": {
>     "script": {
>       "source": "return true",
>       "lang": "painless"
>     }
>   },
>   "metadata": {
>     "last_period": "5m",
>     "window_period": "24h",
>     "name": "testtest",
>     "xpack": {
>       "type": "json"
>     }
>   },
>   "result": {
>     "execution_time": "2021-03-08T20:49:58.503Z",
>     "execution_duration": 472,
>     "input": {
>       "type": "search",
>       "status": "success",
>       "payload": {
>         "_shards": {
>           "total": 53,
>           "failed": 0,
>           "successful": 53,
>           "skipped": 49
>         },
>         "hits": {
>           "hits": [],
>           "total": 10000,
>           "max_score": null
>         },
>         "took": 470,
>         "timed_out": false,
>         "aggregations": {
>           "periods": {
>             "buckets": {
>               "last_period": {
>                 "doc_count": 42242,
>                 "hosts": {
>                   "doc_count_error_upper_bound": 0,
>                   "sum_other_doc_count": 0,
>                   "buckets": [
>                     {
>                       "doc_count": 3619,
>                       "key": "FortiGate-60F"
>                     },
>                     {
>                       "doc_count": 3524,
>                       "key": "NC-Firewall"
>                     },
>                     {
>                       "doc_count": 982,
>                       "key": "TexasFG"
>                     },
>                     {
>                       "doc_count": 4,
>                       "key": "ok-int-wall2600"
>                     }
>                   ]
>                 }
>               },
>               "history": {
>                 "doc_count": 7415336,
>                 "hosts": {
>                   "doc_count_error_upper_bound": 0,
>                   "sum_other_doc_count": 0,
>                   "buckets": [
>                     {
>                       "doc_count": 519864,
>                       "key": "FortiGate-60F"
>                     },
>                     {
>                       "doc_count": 400606,
>                       "key": "NC-Firewall"
>                     },
>                     {
>                       "doc_count": 148454,
>                       "key": "TexasFG"
>                     },
>                     {
>                       "doc_count": 1176,
>                       "key": "ok-int-wall2600"
>                     }
>                   ]
>                 }
>               }
>             }
>           }
>         }
>       },
>       "search": {
>         "request": {
>           "search_type": "query_then_fetch",
>           "indices": [
>             "filebeat-*"
>           ],
>           "rest_total_hits_as_int": true,
>           "body": {
>             "query": {
>               "range": {
>                 "@timestamp": {
>                   "gte": "now-24h"
>                 }
>               }
>             },
>             "aggs": {
>               "periods": {
>                 "filters": {
>                   "filters": {
>                     "history": {
>                       "range": {
>                         "@timestamp": {
>                           "gte": "now-24h"
>                         }
>                       }
>                     },
>                     "last_period": {
>                       "range": {
>                         "@timestamp": {
>                           "gte": "now-5m"
>                         }
>                       }
>                     }
>                   }
>                 },
>                 "aggs": {
>                   "hosts": {
>                     "terms": {
>                       "field": "observer.name",
>                       "size": 10000
>                     }
>                   }
>                 }
>               }
>             },
>             "size": 0
>           }
>         }
>       }
>     },
>     "condition": {
>       "type": "script",
>       "status": "success",
>       "met": true
>     },
>     "actions": [
>       {
>         "id": "log",
>         "type": "logging",
>         "status": "simulated",
>         "transform": {
>           "type": "script",
>           "status": "success",
>           "payload": {
>             "_value": true
>           }
>         },
>         "logging": {
>           "logged_text": "Systems not responding in the last 5m minutes:true: "
>         }
>       }
>     ]
>   },
>   "messages": []
> }

The top-level condition and action-level transform seem to play the following roles in this watch:

  • The condition compares the current period's host list length with the reference period's host list length and causes the action to trigger when the current host list's length is smaller. With your change to the "hosts" terms agg you defined that a distinct host is identified via its observer.name field. Since the example script only refers to the aggregation names it shouldn't require any change to work.
  • The action-level transform extracts the list of "missing" host identifiers from the response to make it available in the message. Since it just refers to the key of each terms agg bucket it shouldn't require any change and return a list of observer.name values.

Given this I wonder why it wouldn't work with the scripts included in the example. :thinking: Can you try and tell us the (simulated) results you got?

This topic was automatically closed 24 days after the last reply. New replies are no longer allowed.