My task is to introduce log analysis to an environment where I am a drop-in replacement for Syslog and I do not have the option of asking clients to use a different shipper.
In this environment logs are forwarded to my machine via Syslog from multiple different systems running a variety of applications. I listen to the Syslog stream with Logstash, and send to ElasticSearch.
I need to match each line to any one of a number of log format patterns, and once matched, parse out the appropriate fields and set meta identifying type.
I also need to recognize and deal with different types of multiline messages, appending only the syslog-message portion to the previous syslog-message and not the entire line including syslog preamble. This appears to be very tricky if possible at all, possibly requiring a combination of multiline, match and mutate. My search for information on this has been complicated by the evolution of the multiline filter to a codec, making it hard to tell whether some advice is currently relevant.
As this use case seems to me to be a nut that a few people have likely cracked, I'm looking for pointers and links that will set me on the right path.
I would start with finding out whether you can use a syslog input. Syslog is many things to many people. A syslog input expects RFC 3164 messages. Otherwise you would might use a TCP input. (Or there was a recent post suggesting using rsyslog to talk to all those syslog daemons and forward to logstash.)
If you need to selectively apply a multiline codec you may want to have multiple pipelines. Have your main pipeline figure out what multiline treatment based on host or a pattern match or whatever then use tcp output/input pairs to apply the multiline codec. If you cannot get multiline it might be possible to use aggregate, but any pipelines running aggregate are restricted to a single pipeline worker thread (also, you want pipeline.java_execution set to false in logstash.yml until this bug is fixed).
In a past experience taking on several new data feeds at the same time I found it helpful to tag data once I thought it was being parsed correctly and feed that to a different index. All the stuff that had not been parsed properly was fed into a 'fixme' index.
Excellent - using rsyslog with the json output to logstash solves a couple of issues:
listening on privileged port 514
parsing RFC Syslog fields without wheel reinvention
I'll get some practice on multiple pipelines.
Thanks for the tip re the fixme index - a good practice.
I wonder if anyone can recommend a development workflow. I find myself with multiple sessions open:
one to edit the logstash config, then HUP the process to reload
another to use logger -f example-data.log to send the data
a kibana browser window to see the results
I suspect I should skip the Kibana step by outputting directly the console via ruby-debug output, but maybe there are existing tools and scripts that make the Logstash SDLC a bit less wonky?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.