Metrics Beat

More general than Graphite Beat, I would suggest a beat that can generate metrics and send it to Logstash (which can in turn send it to Graphite / Influx / OpenTSDB / other).

I love the nagios / sensu model of simply having a script scheduler that knows how often a specific metrics scraper is supposed to run (60s, hourly, etc) and then the actual script is left to the operator to implement, or to get one from the community (e.g. sensu-community-plugins, in particular the "metrics" plugins).

The runner that executes other scripts is an extremely flexible approach that lets anyone in the community to get their metrics in the way that makes most sense:

  • Query for the metric in different ways (read filesystem, query via HTTP or a custom protocol, run a command, etc)
  • Parse various forms of text output
  • The operator can use the language that suits them best, if what's available doesn't make sense for their situation

Of course, that doesn't prevent offering such pre-built scripts or even mechanisms other than running external scripts.

The Sensu community took the unfortunate turn of using the Graphite line protocol directly. Since Graphite doesn't support meta-data, this means you have to encode said meta-data in your metric name. E.g. mysql.clustername.master.connections X TS, which can be hard to extend later without breaking a lot of existing dashboards.

I would propose that Metrics Beats use a more general format, like InfluxDB's or OpenTSDB's. Examples:
Influx: mysql.connections X TS {"cluster": "clustername", "role": "master" }
OpenTSDB: mysql.connections X TS cluster=clustername role=master

Both of which can be turned into Graphite format, if need be.

I hope this didn't turn into too much of a rant. But basically, I was on the verge of getting my team to start writing this tool this week or the next, sending via logstash-forwarder*. So we'll be looking very closely at what you're building and try to contribute and use Beats, if possible.

* Logstash-forwarder/lumberjack is already a secure transport from all of our infrastructure to our ELK cluster. I don't want to have metrics come in via a second, different way that also needs to be secured & so on. This approach has been heavily influenced by the ideas presented in this foundational article, by one of the Kafka creators: The Log: What every software engineer should know about real-time data's unifying abstraction

2 Likes

Hi, thanks for putting this in writing, I think it makes a lot of sense. You are right, getting all the operational data over the Lumberjack protocol into the Logstash pipeline is pretty much the idea for Beats. I'm glad that resonates.

I also agree that an approach similar with sensu/nagios that simply uses stdout would be the lowest friction for adoption.

I started on a SocketBeat to accept TCP connections and send the lines as events. It's super rough. This was so a local TCollector agent and a few apps with local OpenTSBD reporters can all write to local host and have the data buffered and written to our ELK pipeline. The udp_bridge and tcp_bridge collectors I wrote for TCollector aren't very good.

Having a Prospector kick off Metric Harvesters (like the file harversters in logstash-forwarder) that will execute scripts (like TCollector collectors) would make a lot of sense in my environment.

Maybe I'll switch gears on SocketBeat to go that route. Would that be helpful in your use case?

Currently I have very rough code which accepts TCP connections and ships off the lines. It assumes the input is lines intended for OpenTSDB, so the output of a TCollector Collector.

The next steps are to implement something similar to the Prospector/Harvester model like logstash-forwarder users. A prospector will execute commands as configured at various time intervals (like TCollector does). The harvester will contain the running script and will log errors from it's stderr and pass lines from it's stdout to the event channel.

From there I would want to move the parsing logic to a filter and make it configurable to read Graphite, OpenTSDB and other formats and send standardized events with "message" being the original line and fields added as appropriate based on what format it is parsing.

Here is one OpenTSDB line parsed with the current code:

{
count: 1,
line: 2,
message: "put kafka.producer.ProducerRequestMetrics.ProducerRequestRateAndTimeMs.98percentile 1432876501 0.11 serverid=ps515 server1010.dc.example.com=brokerPort brokerHost=server1010.dc3.example.com domain=dc3 host=server1001 machine_class=server",
metric_name: "kafka.producer.ProducerRequestMetrics.ProducerRequestRateAndTimeMs.98percentile",
metric_tags: "serverid=ps515 server1010_dc3_example_com=brokerPort brokerHost=server1010_dc3_example_com domain=dc3 host=server1010 machine_class=server",
metric_tags_map: {
server1010_dc3_example_com: "brokerPort",
brokerHost: "server1010_dc3_example_com",
domain: "dc3",
host: "server1010",
machine_class: "server",
serverid: "ps515"
},
metric_timestamp: "1432876501",
metric_value: "0.11",
offset: 213,
shipper: "C02MR0K3FD58",
source: "127.0.0.1:59735",
timestamp: "2015-05-29T05:15:02.427Z",
type: "tcollector"
}

Its a great start for another Beat, much faster than I expected :). Thoughts are still rough on our idea for metrics beats, but I would hope they would not need 2 processes to run (TCollector and another beat), maybe we can combine the metric collection itself into the metrics beat to a single shipper.

But, I love the idea of a beat that can already utilize existing collectors for now.

I don't think we'd need two processes.

My intention would be that this beat would replace TCollector. It's a SocketBeat now, because that is what I needed for what I was doing yesterday. I'm going to work on cleaning it up and then I'll convert to actually run the collectors the same way TCollector does.

I move the line parsing to a filter plugin, so that it can parse outputs from collectors that emit OpenTSDB format, and then other formats as filters plugins are written.

I would love to join in on any conversations about metrics beat.

Being able to shell out to run collectors is a very valuable extension method. There is no reason some core collectors, like procstats, netstat, etc, couldn't be directly coded into MetricsBeat. Being able to specify a command to execute will resonate well with TCollector and Sensu users and provide a great deal of flexibility.

I love the discussion this is generating. Let me throw another idea onto the pile here.

A few months ago, one of my employees pointed out the fact that whenever logstash-forwarder is unable to talk to Logstash, contrary to other systems (e.g. the syslog variants), it just does nothing. It stops processing the logs. It doesn't need to buffer anything: the buffer is the log file itself. And if Logstash is down for too long, well the log gets rotated and tough luck, you've lost some logs. I thought that was genius! It fits very well with the Unix philosophy: logstash-forwarder is great at reading and shipping log files, and the disk space is handled by logrotate.

Compared to rsyslog, which I've had blow up on me when losing its connection to the syslog server (yes, TCP). It tried to queue up the messages, and either because of a bug or a misconfiguration, ended up taking 100% CPU until it was "dealt with".

So back to ELK and metrics. I run a DevTools team, and each of our company's product team runs their own infrastructure. One of the services my team offers other teams is a hosted ELK stack. They just need to install logstash-forwarder on their servers and point it to our centralized system.

Our next big project is to offer the same kind of turnkey solution around gathering metrics. I'm a big fan of Etsy's philosophy of making metrics as easy as possible to gather.

Now that my teams have logstash-forwarder installed I already have:

  • a secure transport from all of the infrastructure to our analytics system (a plain ELK for now)
  • a good buffering mechanism for the inevitable planned or unplanned downtimes of ELK: the log files themselves.
  • a buffering mechanism that's trivial to understand and configure by any operator: logrotate.

So my idea for the metrics project was simply to tell my tenants to drop all of their metrics in one or more metrics logs, and give them guidance on how to properly feed them (e.g. InfluxDB line format) and tag them in logstash-forwarder so I know process them as metrics.

This would give everyone a very simple way to feed us metrics. Heck, the simplest bash script can now feed us metrics by appending to a text log. And if performance / minimizing IO in-request is crucial, they can buffer the writes or make them async (e.g. local syslog from socket to file).

What this "new" system gives me is still more of the same:

  • uses the same secure transport (no need for e.g. Sensu that talks to RabbitMQ)
  • same buffering mechanism: any good sysadmin will already logrotate new logs they generate on the server, I don't need to tell them how to do that.

The only piece missing would now have been that "metrics scraper" + a pile of ready-made scripts to help my tenants monitor the basics and the systems hosted on each server.

So the solution I initially had in mind wasn't to build a scraper that talks lumberjack. It was to build a scraper that saves to one or more files, just like I'll tell people to do for their custom / ad-hoc metrics.

Not sure how that fits with the Beats philosophy here. I haven't had time to try Packetbeats and see how it works. But perhaps reusing that idea (file as a buffer) could be an MVP for new beats? Or one of the output options (besides direct ElasticSearch feeding, and soon the lumberjack protocol)


Edited: I replaced "caching" by "buffering", which represents more accurately what I mean.

Hi, I just want to point out to Metrics 2.0 webpage by Dieter Plaetinck. He is proposing a better metrics format, which provides self-describing metrics.

1 Like