How to "drill down" through Filebeat monitoring to identify emergent errors or trends?

We are currently doing a trial of Elastic Cloud and we're using Filebeat to gather logs from a few hosts. At the moment we're taking some simple performance measurements and (generally) things look positive.

In our production scenario, we'll have about 300 hosts, each running Filebeat (tailing 1 to 3 log files each) and sending data to Elasticsearch.

Suppose that one or more of these 300 Filebeat instances either stops sending data, or encounters output errors, or otherwise misbehaves. Ideally, we'd have some alerts set up to identify such problems almost immediately. At a minimum, we'd like to have some visualizations and dashboards to start with aggregated data and then sift through that in a minute or two.

I don't see a pre-configured way to (for example) look at aggregate output errors across all (or a subset) of Filebeat instances, or to investigate idle or failed instances which are not publishing events at all.

Has anyone else dealt with this challenge? If so, how did you attack it? My naive instinct is to build some custom visualizations by looking at the .monitoring-beats indices, examining the fields reported in there, and doing some queries accordingly.

There's a few ways to do this.

The first would be to setup some Machine Learning jobs on the logs to watch for unusual patterns (errors, rate changes, etc). From there you can also easily enable Alerting.

The second would be to setup simple thresholds and then do Alerting on those.

Thanks, I found the fields in the beats_stats structure that are used in the monitoring dashboard.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.