Am looking at developing this - wrapping up a log parsing library to integrate into Filebeat.
The issue is around matching up start/end log entries, together with various intervening log entries for some commands (log format is a little idiosyncratic).
Optional will handle Perforce server structured logging (much more regular format, although still with start/end entries requiring matching) - but that's down the track.
Did consider creating a custom beat, but it seems like being a custom module will allow easier hooking in to the logic to read log files appropriately.
So planning a shared library which will parse a stream of log lines, and spit out json entries as appropriate. This library will be called by new module (as well as existing standalone analysis utils).
One challenge is that the module will maintain a list of current entries for which it has a start but not yet an end record. This list needs to be saved somewhere when the service is stopped, and then read again on startup as it restart with log processing. Not yet sure what is best way to do such save/restore of state.
Filebeat could read log lines for you and aggregate multiline messages into a single one. The messages could be forwarded to Elasticsearch which does the parsing for you. The progress of sending/reading logs are tracked by Filebeat, so there is no data duplication in the output.
Rather like Auditbeat, would want the module to match up start and end records, extract other information where present (e.g. --- lines above) and return the fields. Then provide single cleaned up records to Elasticsearch.
There is usually a completed record with same pid (and other fields) which denotes end of record. But in some cases extra information is attached (known as track info) which records db lock info etc. The parsing is a bit tricky, but understood problem. The question is where in the process is best to do the matching of start/end records. This is where I assume a custom module is good.
Only downside I can see is that a custom module is then included whole. It is less standalone than a custom Beat.
You could aggregate log lines into multiline events. I looked at the example logs and it seems like all last lines of all multiline events starts with ---. Is that correct?
If yes, you could configure a multiline pattern, so these logs are aggregated into the same event by Filebeat. Then you could do the processing in ES using the processors of the Ingest node: https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest-processors.html
If not, Filebeat does not support multiple multiline configurations, so you might need to solve aggregating lines differently. But if I were you and it was possible I would stick with a Filebeat module, because it's requires less development, so it can be done more quickly.
Or is it a strict requirement to do preprocessing before sending the event to ES?
I'm not worried about the log parsing, but am looking for ways to avoid re-inventing the wheel. So in a custom Beat, it would be great to be able to build on top of Harvester and similar libraries, e.g. take advantage of things like storing offset within file, detecting renames etc. But I am not sure I can sensibly just use those bits of functionality.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.