Processing remote ZIP files with Logstash

I'd just like to verify I am attempting this correctly. I am trying to start a Logstash pipeline that first downloads some large (~500MB) ZIP files via http, then unzips them, then processes that data. I am having trouble with the first part, though.

  1. Should I use http_poller for this?
  2. I see a gzip codec. Should I be using this with .ZIP, or is that not going to work?
  3. I get these logs when trying with http_poller:

eb 16 20:50:28 lamp-01 systemd[1]: Started logstash.
Feb 16 20:50:42 lamp-01 logstash[12860]: Sending Logstash's logs to /var/log/logstash which is now configured via log4j2.properties
Feb 16 20:53:00 lamp-01 logstash[12860]: java.lang.OutOfMemoryError: Java heap space
Feb 16 20:53:00 lamp-01 logstash[12860]: Dumping heap to java_pid12860.hprof ...
Feb 16 20:53:00 lamp-01 logstash[12860]: Unable to create java_pid12860.hprof: Permission denied

Any idea why this is happening? These appear only a few minutes after starting the pipeline, and I allocated 2GB of memory to Logstash.

If I read it correctly, a Zlib::GzipReader reads the whole object (500 MB) into memory and decompresses it. I am not surprised it blows up a 2 GB heap. To me it does not appear to be a stream reader.

I would experiment with much smaller files and see what the impact on the verbose GC logs looks like. But then I am a GC nerd.

I don't have any codec currently specified though. Just a simple config with http_poller set on a cron and a file output location.

I don't think there's a reasonable way of doing this with Logstash alone. I suggest you run a script alongside Logstash that downloads the zip files, unpacks them, and hands them over to Logstash.

I see a gzip codec. Should I be using this with .ZIP, or is that not going to work?

It's not going to work. Gzip just compresses a stream of data, zip is an archive file format that also compresses.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.