I'm implementing an ELK stack for Java logs indexing and analysis. At the moment, it's only a proof of concept and I cannot feed Logstash with plain text log files; I've got gzipped past log files and I want to index their content.
Given that I cannot use multiple codecs (gzip_lines and multiline), which is the best solution to index them? Should I aggregate lines and then feed Logstash with the result? Or are there other ways to reach my goal?
I've written a Python script to read lines from GZIP files and feed Logstash via http_input plugin but I suppose it is not the best solution (according to the long times needed to index files).
I suggest that you use your python script to unzip to another folder and have filebeat read the unzipped files and send them to the beats input. Filebeat has support for multiline.
Meanwhile I've implemented a Java program to extract log entries —taking account of multiline entries too— and push them into a Redis list. Then Logstash takes such entries from Redis and index them into Logstash.
It works but I've not measured performances yet.
In this manner I know when all lines of the GZIP file are processes. In the solution you suggested is there a way to know when the file is processed? The goal is to remove the uncompressed file as soon as its content is loaded somewhere and ready to be indexed.
If you need more performance you can push alternate file contents to two redis instances and use two LS instances to read from each redis and output to the same ES index.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.