Monitor/alert beats dropped events

Hi,
i need to alert me on filebeat dropping events. Even better alerting me on the kind of error that lead to the dropping. I'm running filebeat 7.15 on rhel-7. First tack i tried was to use the internal monitorings system (cluster-monitoring app) but it looks like i cannot create alert rules for beats in this section. Secondly i tried to integrate the .monitoring-beats* index pattern into metrics, to create an alert in the metrics app, but no joy ( beats.* metrics dont show up in the metrics app). I'm a bit at al loss for the moment. Anyone else ever tried something like this? (maybe ingest the filebeat log with filebeat(inception alert?) and try creating a log alert?
any hints highly appreciated
best regards
Serge

Hi,

have tried with Kibana Alerting: Kibana Alerting: Alerts & Actions for Elasticsearch data | Elastic?

yes tried that one, but the problem with the alerting eninge is, i cannot just alert on anything i like.
if i want to evaluate rates/derivates etc i can only do that in the "metrics" part, and therefore i would have to have the .monitoring-beats* index pattern in my metrics application/section (which doesn't work, the metrics dont show up in the explorer).

the alert on a plain elastic query only ever alerts on the number of hits, not on the numerical result of an aggregation.
the "index threshold" alert, can't do rate or derivative aggregations and "anomaly detection" is not reliable enough.

im wondering, am i really the first one to want to know if my beats loose events? there must be more installations were the event transportation is critical, and is monitored in some way

correct - Kibana Alerting doesn't have the high levels of flexibility that Watcher gives (at the cost of being more complex)

Hi @smueller Welcome to the community.

I think I have a solution for you ... but let me triple check and I will get back later.

EDIT : No hmmm I could not get what I wanted to works ... seems like you are correct... yes odd to think...

No you are not for sure ... the monitoring data will show the beats stats acked, dropped errors etc... but yes the alerts seem to be missing.

thanks for double checking, was wandering about my state of mind.
this whole beat monitoring is a bit strange. the cpu usage data is different from other process monitoring data, (i cant even replicate the cpu usage graph from the cluste rmonitoring screen...
and i have little to no way to use the alerting engine for rates.

complexity is not problematic, as long as it's logical. i'll try complex watchers with json setup and report back, thanks for the hint

Ok cool - here are some examples that may help your understanding!

1 Like

Well thanks again, that gave me the push in the right direction. it took me a few wrong turns and i ended up with this:
look back for two minutes in the beats-monitoring index, bucket-aggregate by hostname, return stats of event-dropped counter, substract min counter value from max and you have the number of dropped events for the two minutes per host.

POST /_watcher/watch/filebeat_dropped_events
{
  "trigger": {
    "schedule": {
      "interval": "2m"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": [
          ".monitoring-beats*"
        ],
        "body": {
          "query": {
            "bool": {
              "filter": [
                {
                  "range": {
                    "timestamp": {
                      "gte": "now-2m",
                      "lte": "now"
                    }
                  }
                },
                {
                  "match_phrase": {
                    "beats_stats.beat.type": "filebeat"
                  }
                }
              ]
            }
          },
          "size": 0,
          "aggs": {
            "hosts": {
              "composite": {
                "sources": [
                  {
                    "host": {
                      "terms": {
                        "field": "beats_stats.beat.host"
                      }
                    }
                  }
                ]
              },
              "aggs": {
                "event_stats": {
                  "stats": {
                    "field": "beats_stats.metrics.libbeat.output.events.dropped"
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "script": """
      ctx.payload.aggregations.hosts.buckets.stream().filter( el -> el.event_stats.max != el.event_stats.min ).collect(Collectors.toList());
      return(ctx.payload.aggregations.hosts.buckets.length > 0)
    """
  },
  "actions": {
    "log": {
      "transform": {
        "script": """
          def out=ctx.payload.aggregations.hosts.buckets.stream().filter( el -> el.event_stats.max != el.event_stats.min ).map( el -> ['name': el.key.host, 'dropped':  el.event_stats.max - el.event_stats.min ] ).collect(Collectors.toList());
          return(out)
        """
      },
      "logging": {
        "text": "{{#ctx.payload._value}}host {{name}} lost {{dropped}} events. {{/ctx.payload._value}}"
      }
    }
  }
}

in the production setup it also writes chatmessages via webhook, but for demonstration purposes logging is sufficient.

the condition checks for inequality, because i have no need for thresholds.
the count calculations for the alert are a bit dangerous, it does not account for counter resets or -overflows. any suggestion for that or other criticism is highly appreciated.

all the best and stay safe
serge

1 Like