I love the discussion this is generating. Let me throw another idea onto the pile here.
A few months ago, one of my employees pointed out the fact that whenever logstash-forwarder is unable to talk to Logstash, contrary to other systems (e.g. the syslog variants), it just does nothing. It stops processing the logs. It doesn't need to buffer anything: the buffer is the log file itself. And if Logstash is down for too long, well the log gets rotated and tough luck, you've lost some logs. I thought that was genius! It fits very well with the Unix philosophy: logstash-forwarder is great at reading and shipping log files, and the disk space is handled by logrotate.
Compared to rsyslog, which I've had blow up on me when losing its connection to the syslog server (yes, TCP). It tried to queue up the messages, and either because of a bug or a misconfiguration, ended up taking 100% CPU until it was "dealt with".
So back to ELK and metrics. I run a DevTools team, and each of our company's product team runs their own infrastructure. One of the services my team offers other teams is a hosted ELK stack. They just need to install logstash-forwarder on their servers and point it to our centralized system.
Our next big project is to offer the same kind of turnkey solution around gathering metrics. I'm a big fan of Etsy's philosophy of making metrics as easy as possible to gather.
Now that my teams have logstash-forwarder installed I already have:
- a secure transport from all of the infrastructure to our analytics system (a plain ELK for now)
- a good buffering mechanism for the inevitable planned or unplanned downtimes of ELK: the log files themselves.
- a buffering mechanism that's trivial to understand and configure by any operator: logrotate.
So my idea for the metrics project was simply to tell my tenants to drop all of their metrics in one or more metrics logs, and give them guidance on how to properly feed them (e.g. InfluxDB line format) and tag them in logstash-forwarder so I know process them as metrics.
This would give everyone a very simple way to feed us metrics. Heck, the simplest bash script can now feed us metrics by appending to a text log. And if performance / minimizing IO in-request is crucial, they can buffer the writes or make them async (e.g. local syslog from socket to file).
What this "new" system gives me is still more of the same:
- uses the same secure transport (no need for e.g. Sensu that talks to RabbitMQ)
- same buffering mechanism: any good sysadmin will already logrotate new logs they generate on the server, I don't need to tell them how to do that.
The only piece missing would now have been that "metrics scraper" + a pile of ready-made scripts to help my tenants monitor the basics and the systems hosted on each server.
So the solution I initially had in mind wasn't to build a scraper that talks lumberjack. It was to build a scraper that saves to one or more files, just like I'll tell people to do for their custom / ad-hoc metrics.
Not sure how that fits with the Beats philosophy here. I haven't had time to try Packetbeats and see how it works. But perhaps reusing that idea (file as a buffer) could be an MVP for new beats? Or one of the output options (besides direct ElasticSearch feeding, and soon the lumberjack protocol)
Edited: I replaced "caching" by "buffering", which represents more accurately what I mean.