I'd just like to verify I am attempting this correctly. I am trying to start a Logstash pipeline that first downloads some large (~500MB) ZIP files via http, then unzips them, then processes that data. I am having trouble with the first part, though.
Should I use http_poller for this?
I see a gzip codec. Should I be using this with .ZIP, or is that not going to work?
I get these logs when trying with http_poller:
eb 16 20:50:28 lamp-01 systemd[1]: Started logstash.
Feb 16 20:50:42 lamp-01 logstash[12860]: Sending Logstash's logs to /var/log/logstash which is now configured via log4j2.properties
Feb 16 20:53:00 lamp-01 logstash[12860]: java.lang.OutOfMemoryError: Java heap space
Feb 16 20:53:00 lamp-01 logstash[12860]: Dumping heap to java_pid12860.hprof ...
Feb 16 20:53:00 lamp-01 logstash[12860]: Unable to create java_pid12860.hprof: Permission denied
Any idea why this is happening? These appear only a few minutes after starting the pipeline, and I allocated 2GB of memory to Logstash.
If I read it correctly, a Zlib::GzipReader reads the whole object (500 MB) into memory and decompresses it. I am not surprised it blows up a 2 GB heap. To me it does not appear to be a stream reader.
I would experiment with much smaller files and see what the impact on the verbose GC logs looks like. But then I am a GC nerd.
I don't think there's a reasonable way of doing this with Logstash alone. I suggest you run a script alongside Logstash that downloads the zip files, unpacks them, and hands them over to Logstash.
I see a gzip codec. Should I be using this with .ZIP, or is that not going to work?
It's not going to work. Gzip just compresses a stream of data, zip is an archive file format that also compresses.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.