Here's an example result that I pulled from one of our dev env servers:
What we're after is looking for situations where filebeat is unable to successfully send data to Logstash, or any other situations where data for some reason isn't sent. I guess the 'pipeline.events' object contains the data that I'm after, but I don't know what each of the metrics actually means.
Anyone able to help me figuring out what we should look for? The plan is to write a simple Datadog agent plugin that exposes the necessary metrics, and set up alerting from there.
@trondhindenes Did you take a look at monitoring beats with x-pack? Beats monitoring comes with x-pack basic which will send all these stats to Elasticsearch, all the stats are stored into indices so you can query them.
We plan to add alerting in future versions, we will throw an alert if we didn't receive stats for a period of time and do heuristic on the rates of events. But that It could be done externally.
We don't use the paid version of x-pack so alerting isn't available to us. We use Elasticsearch solely for log ingestion and other tools for infrastructure monitoring.
TBH, the filebeat rest endpoint seems to do more than enough of what we need, I just need to figure out what the exposed metrics actually mean. I realize the stats endpoint is in a pre-release state so I guess "formal" documentation isn't in place yet, but looking at the similar Logstash monitoring api those metrics never got properly documented either, so that's why I'm asking.
Edit: I see now that x-pack basic includes something called "full-stack monitoring" and I assume filebeat monitoring is part of that. Tho it looks like one needs the paid version to get alerting (I honestly don't know if I'd call monitoring without alerting "monitoring"), but anyways: We'd still like to tap into the stats api directly from our "regular" monitoring tooling.
I'm glad the data is useful for you. One of the reasons we didn't document is the field yet as the structure of the events will change slightly.
For the naming we "try" to aim to make it as self explanatory as possible which is not always easy. If you have 3-4 metrics which you are interested, happy to explain them here.
For the metrics you asked above: We had quite a few discussions in the past that it's an issue which is tricky to track as there are several metrics influenced by it and we need to improve. I thought there is also a github issue on it but couldn't find it
Ultimately we just want to answer "can we send stuff to logstash?" I guess the way to answer that is to look at the size of Filebeat's "queued" messages and the frequency of errors from trying to send. Are these somehow exposed directly or indirectly? Its not clear to me looking at the json response what counter(s) I should be monitoring.
I think it's important to mention that Filebeat can overload Logstash and will automatically backoff. This will increase the queue size but I'm not sure if we have all the metrics there yet. At least in your json not enough queue stats showed up.