We are currently doing a trial of Elastic Cloud and we're using Filebeat to gather logs from a few hosts. At the moment we're taking some simple performance measurements and (generally) things look positive.
In our production scenario, we'll have about 300 hosts, each running Filebeat (tailing 1 to 3 log files each) and sending data to Elasticsearch.
Suppose that one or more of these 300 Filebeat instances either stops sending data, or encounters output errors, or otherwise misbehaves. Ideally, we'd have some alerts set up to identify such problems almost immediately. At a minimum, we'd like to have some visualizations and dashboards to start with aggregated data and then sift through that in a minute or two.
I don't see a pre-configured way to (for example) look at aggregate output errors across all (or a subset) of Filebeat instances, or to investigate idle or failed instances which are not publishing events at all.
Has anyone else dealt with this challenge? If so, how did you attack it? My naive instinct is to build some custom visualizations by looking at the .monitoring-beats indices, examining the fields reported in there, and doing some queries accordingly.