Logstash out of memory with large gz input (caused by json filter)


(Aaron Daisley) #1

Hi.

I've currently got a Logstash instance in which I'm trying to ingest a fairly large set of gzipped data every day (600-800mb compressed, 6m-11m events). I initially started with the file input plugin on v6.4.0 (and upgraded to 6.5.0 when it released) and I tried using the file input plugin on the gz files, and the instance could maybe get through one or two files before constantly garbage collecting and the input rate dropping quite severely. I tried this with 1gb, 2gb, 3gb and 4gb of heap size and the same thing happened every time. The file input plugin has trouble picking up new files in the chosen “input bucket”, too, but that’s another issue for another post.

I then tried splitting the input files on every 10k and 100k lines but it didn't make much of a difference so it's not the size of the input that's causing it to constantly garbage collect. I've also tried playing with the number of workers and batch sizes but it doesn't make much of a difference. The java execution engine helps a bit but only prolongs the time it takes to become unusable.

I've since moved to the s3 input plugin, as this can take files in the s3 input bucket and convert .gz files by default. On Logstash 6.5, the Logstash process starts an insane amount of garbage collecting after a few minutes:

I then tried dropping Logstash down a version to 5.3.2 - this seems to be a lot healthier in terms of memory usage and CPU utilisation, with garbage collection looking healthy too. It doesn’t reach the totally scary point that 6.5 does.

There's several challenges with this, however.
a) Our team would ideally like to stay up-to-date on the Logstash versions going forward, so 5.3.2 is quite out-of-date.
b) The rate is very, very slow in terms of outputting to ES. Using metrics I logged the rate and total number of events, with 6.5's S3 input on the left and 5.3.2's S3 input on the left. Both are using 1gb of heap size. 4k events per second as opposed to ~15k.

Looking around a little bit it seems as though Elastic moved Logstash's worker engine to Ruby around the release of 6.0 and I think this is what's causing the issue. Not knowing too much about Java VMs, Ruby, and the Logstash process I can't exactly come up with a solution.

I was wondering if anyone had a solution to either speeding up the 5.3.2 S3 input, or more preferably stopping the 6.5 input from running out of memory. We'd need this to stay running continuously and would ideally not want to start up a new Logstash process every single day.

Thanks


(Aaron Daisley) #2

So I kept digging at this. I've narrowed down the issue now. I'm currently running logstash 6.5 with an S3 input and I take the message field and deserialise it with a json{} (plugin-filter-json) filter. This seems to be what's causing the out of memory issue.

I don't know if the sheer volume of events is what's causing it or whether there's an issue with deserialising json events in Logstash 6.x.

Would anyone care to weigh in on this?

Edit: I've swapped to the json codec for the input plugin and no issues. Will create an issue on the json filter Github.