LS 7.16.3 OutOfMemory jruby.RubyHash WatchedFilesCollection

My Logstash instances recently have started running OutOfMemory and need some help in identifying the cause.

My logstash.conf has not been changed for a long time but I did do a few Logstash version upgrades in the past few months (though the upgrades were from 7.x to another 7.x)

One notable attribute of my environment is that there are lot of log files and they also roll quite often so logstash has to keep track of a lot of "watched" files. Historically this hasn't been an issue until now when the OOMs started to happen.

What values in logstash.conf should I tweak in relation to org/logstash/filewatch/WatchedFilesCollection? First option comes to mind is max_open_files but I'm not sure if there are any others I should try?

  file {
    sincedb_path          => "some path"
    max_open_files        => 10000
    close_older           => 0.001
    stat_interval         => 1
    discover_interval     => 5
    sincedb_clean_after   => 1
    ignore_older          => 259200 # 3 days
    path                  => "some path"
    type                  => "some type"
    start_position        => "beginning"
    file_sort_direction   => "desc"

    codec => multiline {
      patterns_dir        => ["${LS_PATTERNS}"]
      pattern             => "\[%{TIMESTAMP_ISO8601}"
      negate              => "true"
      what                => "previous"
      auto_flush_interval => 150
      max_lines           => 5000
      ecs_compatibility   => disabled
    }
  }

I have not verified that it would accumulate in that place, but if the pattern on a multiline codec does not match it will just keep accumulating data from the file waiting for it to match, and I believe that would happen down inside filewatch. You could try setting the auto_flush_interval option. If you think it will never take more than a minute to write out a complete error message then auto_flush_interval => 60 should be OK.

1 Like

Have you tried to increase memory sizes (Xms and Xmx1g) in jvm.options?

1 Like

Yep that is definitely an option but unfortunately for my case I cannot increase the Xmx values.

You have do almost everything on the input file settings.
Have you tried to set pipeline params? pipeline.batch.size, pipeline.batch.delay, pipeline.workers?
If delay is not a problem, maybe to split 2 or more pipelines, try with a persistent queue... Just ideas

1 Like

The pipeline parameters suprisingly did not work for me. For testing purposes, I set batch size to 1 and number of workers to 1 and that did not have any positive impact whatsoever - my Logstash would still go OOM after some time.

I have almost same results for batch.size, batch.delay. Maybe I do something wrong, however not much impact.
If you use the persistent queue, that means a message is on the disk, not in the memory. Maybe you can split 2-3 pipelines with heaver data and test.
If you use codec => rubydebug, remove it. It consumes a lot of resources.

1 Like

Splitting into pipelines is a great idea to keep in mind for permitting future use cases.

Unfortunately for me again, I have little control over the access to the environment so I can't really write to the disk.

Also one difficulty I can see with the split of pipelines is it would depend on if we can reliably split between heavy and light data sets. In my case, for the same input, sometimes the client would "burst" a lot of heavy data spread across a lot of files, and some other times the same client would write very little data - so it is out of my control.

The only option I have is to play around with the logstash settings to make sure it does not go over the Xmx limit even when there is a burst in data.

1 Like

From further testing, I found out that tweaking max_open_files and other settings suggested above actually do not provide any positive improvement and OOM still occurs.

I found out the root cause to be an input plugin looking at a path that is something like this /path/subdir/*/*/*.log that returns a large number of log files, more than a million.

When this input is disabled, by checking the garbage collection stats (gcutil), I can see the memory pools look healthy. And when this input is enabled, the memory pools are maxed out straightaway:

jstat -gcutil 23623 1000
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
  0.00 100.00 100.00 100.00  86.05  75.61     22    1.453     3    3.549    14    0.713    5.714
  0.00 100.00 100.00 100.00  86.05  75.61     22    1.453     3    3.549    14    0.713    5.714

So it looks like this is related to the file discovery process of the file input plugin. I checked the docs, there doesn't seem to be any option to reduce the list of files it keeps in memory during the discover process. If the list of files is really really large and Xmx is not big enough then Logstash just blows up.

I think this calls for an "enhancement" for the file input plugin to not keep a big list - maybe just traverse in chunks of the list to make sure the memory is not overblown due to the large list of files

Hi @ guyboertje :slight_smile:, do you have any input on this?

Guy is no longer working on the file input (or logstash).

Thanks @Badger , I will open a ticket on github then with some steps to replicate the issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.