Logstash going OOM - help confirming issue

I've recently begun having problems where logstash continues to go out of memory a few seconds (or maybe a minute) after startup. I have an idea of what may be happening, but I'm hoping that some of the more seasoned pro's out here might be able to steer me away from the wrong path, or confirm what my suspicions are.

I have recently added some new Groks, so I am reasonably confident that this is related to filtering (not input or output).

I downloaded VisualVM to analyze the heapdump, and the 'classes' breakdown shows that the Class Name: 'org.joni.StackEntry' has 25.8 million instances (which is 95% of the total instances) and the Size is 931M (71.7% of the size)

I looked up that class, and it seems related to the Regex engine, and my thinking is that I've done something to cause Catastrophic Backtracking (an insanely cool word for something very not cool to happen in Production).
So basically,. I think I've got some regex's that are too many lookaheads (maybe?)

I will put my filter block in a comment (too many characters)

Indexer file: Part 1 (Whole thing is too large for 1 comment):

input {
  redis {
    host => ["REDACTED"]
    port => 6379
    key => "logstash"
    data_type => "list"
  }
}

filter {
  ruby {
    code => "event['logfilename'] = File.basename(event['path'],File.extname(event['path']))"
  }

  if ( "apache2_access.log" in [tags] ) {
    grok {
      match => { "message" => "%{IPORHOST:apache_host} %{COMMONAPACHELOG} - %{QS:referrer} %{QS:unknown} %{QS:agent}" }
      match => { "message" => "%{COMMONAPACHELOG}" }
      match => { "message" => "(?<NOMATCH>(.|\r|\n)*)" }
    }
    mutate {
      gsub => [
        "referrer", "\"", ""
      ]
    }
    mutate {
      gsub => [
        "unknown", "\"", ""
      ]
    }
    mutate {
      gsub => [
        "agent", "\"", ""
      ]
    }
  }

  if ( "apache2_rewrite.log" in [tags] ) {
    grok {
      match => { "message" => "%{IPORHOST:remote_host} %{NOTSPACE:remote_logfile} %{NOTSPACE:remote_user} \[%{HTTPDATE}\] \[%{IPORHOST:apache_host}\/sid#(?<sid>(?:((?>(?!\]))).)*)\]\[rid#(?<rid>(?:((?>(?!\/))).)*)\/(?<rid_type>(?:((?>(?!\]))).)*)\] \(%{NUMBER:count}\) (?<msg>(.|\r|\n)*)" }
      match => { "message" => "(?<NOMATCH>(.|\r|\n)*)" }
    }
  }

  if ( ("REDACTED.log" in [tags]) or ("REDACTED.log" in [tags]) or ("REDACTED.log" in [tags]) or ("REDACTED.log" in [tags]) or ("REDACTED.log" in [tags]) or ("REDACTED.log" in [tags]) or ("REDACTED.log" in [tags]) or ("REDACTED.log" in [tags]) or ("REDACTED.log" in [tags]) or ("REDACTED.log" in [tags]) ){ 
    grok {
      match => { "message" => "%{TIMESTAMP_ISO8601:LogTimestamp}%{SPACE}\[(?<ExecRequest>[^\]]*)]%{SPACE}%{LOGLEVEL:LogLevel}%{SPACE}(?<JavaClass>[0-9A-Za-z\._]+)-\[(?<Block1>(?:((?>(?!\]-))).)*)\]-\[(?<Block2>(?:((?>(?!\]-))).)*)\]-(?<MainMessage>(.|\r|\n)*)" }
      #This is a fallout match. If a log message doesn't match the abover pattern, it will fall into the nomatch bucket
      # This will be helpful to identify any pattern matching work that needs to be done in the future
      match => { "message" => "(?<NOMATCH>(.|\r|\n)*)" }
    }
    date {
      match => [ "LogTimestamp", "ISO8601" ]
      target => "@logtimestamp"
    }
    if [MainMessage] =~ "\A{(.|\r|\n)*" {
      ruby {
        code => "event['json'] = event['MainMessage']"
      }
    }
    if ( "" in [json] ) {
      json {
        source => "json"
      }
    }

    #Removing fields no longer needed
    mutate { remove_field => ["json"] }
    
    #Removing REDACTED field
    mutate { remove_field => [ "REDACTED" ] }
    mutate {
      gsub => [
        "message", "\"REDACTED\":\"(?:((?>(?!\",))).)*\"", "\"REDACTED\":\"****\"",
        "MainMessage",  "\"REDACTED\":\"(?:((?>(?!\",))).)*\"", "\"REDACTED\":\"****\""
      ]
    }
    
    if ( [logfilename] in ["REDACTED","REDACTED"] ) {
      mutate {
        rename => { "Block1" => "SesnId"}
        rename => { "Block2" => "ExpId"}
      }
    }
    if ( [logfilename] in ["REDACTED"] ) {
      mutate {
        rename => { "Block1" => "TranId" }
        rename => { "Block2" => "SesnId" }
      }
    }

    #GEO IP
    if [RemoteAddress] =~ /(?:((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?|(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9]))/ {
      geoip {
        source => "RemoteAddress"
        database => "/etc/logstash/GeoLiteCity.dat"
      }
    }
  }

Indexer file part 2:

  #REDACTED GROUP#
  if ( "REDACTED" in [tags] ) {
    #REDACTED#
    if ( "REDACTED.log" in [tags] ) {
      grok {
        match => { "message" => "\[%{TIMESTAMP_ISO8601:@logtimestamp}\]%{SPACE}%{LOGLEVEL:loglevel}%{SPACE}\[(?<Block1>(?:((?>(?!\]-))).)*)\]\[(?<codefile>(?:((?>(?!:))).)*):(?<linenum>(?:((?>(?!\]-))).)*)\]%{SPACE}(?<logmessage>(.|\r|\n)*)" }
        match => { "message" => "(?<NOMATCH>(.|\r|\n)*)" }
      }
      date {
        match => ["@logtimestamp", "ISO8601" ]
        target => "@logtimestamp"
      }
      if [logmessage] =~ "\A{(.|\r|\n)*" {
        mutate {
          gsub => [
            "logmessage", "=", ":"
          ]
        }
        mutate {
          gsub => [
            'logmessage', '([{ ])([a-z]+):', '\1"\2":'
          ]
        }
        mutate {
          gsub => [
            'logmessage', '(:)(?<!")((?:\b|/)\S+\b)(?!")', '\1"\2"'
          ]
        }
        json { source => "logmessage"  }
      }
    }
    #/REDACTED#
    
    
    #REDACTED#
    if ( "REDACTED.txt" in [tags] ) {
      grok {
        match => { "message" => "%{IPORHOST:clientip}%{SPACE}%{HTTPDUSER:ident}%{SPACE}%{USER:auth}%{SPACE}\[%{HTTPDATE:@logtimestamp}\]%{SPACE}\"(?:%{WORD:verb}%{SPACE}%{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\"%{SPACE}%{NUMBER:response}%{SPACE}(?:%{NUMBER:bytes}|-)" }
        match => { "message" => "%{IPORHOST:clientip}%{SPACE}%{QUOTEDSTRING:x-forwarded-for}%{SPACE}%{QUOTEDSTRING:set-cookie}%{SPACE}%{QUOTEDSTRING:jsessionid}%{SPACE}%{HTTPDUSER:ident}%{SPACE}%{USER:auth}%{SPACE}\[%{HTTPDATE:@logtimestamp}\]%{SPACE}\"(?:%{WORD:verb}%{SPACE}%{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})\"%{SPACE}%{NUMBER:response}%{SPACE}(?:%{NUMBER:bytes}|-)" }
        match => { "message" => "(?<NOMATCH>(.|\r|\n)*)" }
      }
      date {
        match => ["@logtimestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
        target => "@logtimestamp"
      }
      if [clientip] =~ /(?:((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?|(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9]))/ {
        geoip {
          source => "clientip"
          database => "/etc/logstash/GeoLiteCity.dat"
        }
      }
      if [bytes] == "-" {
        mutate {
          gsub => [
            "bytes", "-", "0"
          ]
        }
      }
    }
    #/REDACTED#
    
    #REDACTED#
    if ( "REDACTED.log" in [tags] ) {
      grok {
        match => { "message" => "\[%{TIMESTAMP_ISO8601:@logtimestamp}\]%{SPACE}%{WORD:loglevel}%{SPACE}\[%{NOTSPACE:logger}%{SPACE}(?<method>(?:((?>(?!\]))).)*)\]%{SPACE}(?<logmessage>(.|\r|\n)*)" }
        match => { "message" => "(?<NOMATCH>(.|\r|\n)*)" }
      }    
      date {
        match => ["@logtimestamp", "ISO8601" ]
        target => "@logtimestamp"
      }
    }
    #/REDACTED#
  }
  #/REDACTED GROUP#
}

output {
  elasticsearch {
    hosts => ["REDACTED"]
    index => "logstash-%{+YYYY.MM.dd}"
  }
}

gsub is flat out regexp - Mutate filter plugin | Logstash Reference [8.11] | Elastic - so it's going to be expensive.
The ruby code looks ok but it is also relatively expensive.

What is that all about?

Why not just use Mutate filter plugin | Logstash Reference [8.11] | Elastic

Also, just do;

grok {
  match => { "message" => "PATTERN1", "PATTERN2" }
}

I wrote the
if [ip] =~ <ip regex>
code a while back. It's an attempt to only attempt running the geo filter on fields that have IPs only when they have valid IPs. I just took the regex from the %{IP} block as they were.
I was having cases where the fields in question would have both valid IPs, OR they would have something like 'unknown'. It was just a proactive attempt to not perform geo filtering when it didn't need it.

Regarding the mutate
I could use mutate replace. I am curious what the differences are. Does gsub by itself bring that much overhead? What does the mutate { replace {} } do that makes it cheaper? (Questions for my learning)

For PATTERN1, PATTERN2
I can make that change as well. Similar questions though, is there much performance overhead with having it in multiple match blocks?

--

The OutOfMemory problem is fairly recent, I believe I started noticing it after adding in the last three 'sections'. I was thinking maybe it had something to do with all of the %{QUOTEDSTRING}s, %{NOTSPACE}s, and negative lookaheads I added in. Is your thinking that this is more related to the overuse of gsub, and the overhead it brings along with it (regardless of what matching is happening?)

And, thank you very much for looking over my stuff and providing feedback. I'll be implementing some, if not all, of those tweaks in the next day or two.

Understood, it's just expensive.

It's a straight (aka "dumb") replace, rather than a regexp check and then replace.

Nah, just cleaner :slight_smile:
You could also build your own patterns file and move things there too.

Generally OOM occurs if there is too much in heap, obviously. Why could be many things such as a super big message that comes in, or maybe a regexp having to take up that last bit of memory.

I am not a super awesome logstash person like others, but just pointing out what I can in the hopes that general improvements may help.

How much heap does LS have?

How much heap does LS have?

1G it seems. I tried setting the env variable 'LS_HEAP_SIZE=2048m' butit doesn't seem to recognize it.
I included that in my init script so it's definitely a part of its environment when the LS process starts. I'm on version 2.1 currently, and I'm in the process of rolling out 2.3 up my environment chain.

Could you run ps -aef | grep logstash and post here the result? It should show all switches that got supplied to logstash. With a proper setting you should see something like -Xmx2048m -Xss2048k and these switches should be present only once.

Sure

(15:52:14) PROD ->ps -aef | grep logstash
root     54522     1 99 15:52 pts/0    00:00:17 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/logstash/current/heapdump.hprof -Xmx1g -Xss2048k -Djffi.boot.library.path=/opt/logstash/logstash-2.1.0/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/logstash/current/heapdump.hprof -Xbootclasspath/a:/opt/logstash/logstash-2.1.0/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/logstash-2.1.0/vendor/jruby -Djruby.lib=/opt/logstash/logstash-2.1.0/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/current/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/indexer1.conf &

(Plain text version)
(15:52:14) PROD ->ps -aef | grep logstash
root 54522 1 99 15:52 pts/0 00:00:17 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/logstash/current/heapdump.hprof -Xmx1g -Xss2048k -Djffi.boot.library.path=/opt/logstash/logstash-2.1.0/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/logstash/current/heapdump.hprof -Xbootclasspath/a:/opt/logstash/logstash-2.1.0/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/logstash-2.1.0/vendor/jruby -Djruby.lib=/opt/logstash/logstash-2.1.0/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/current/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/indexer1.conf &

If it's helpful.. Here's the latest log output after the process went OOM again.

java.lang.OutOfMemoryError: Java heap space
Dumping heap to /opt/logstash/current/heapdump.hprof ...
Heap dump file created [1540367433 bytes in 27.384 secs]
{:timestamp=>"2016-08-02T15:58:32.002000+0000", :message=>"Exception in filterworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash.", "exception"=>java.lang.OutOfMemoryError: Java heap space, "backtrace"=>["org.joni.StackMachine.ensure1(StackMachine.java:98)", "org.joni.StackMachine.push(StackMachine.java:162)", "org.joni.StackMachine.pushAlt(StackMachine.java:200)", "org.joni.ByteCodeMachine.opPush(ByteCodeMachine.java:1517)", "org.joni.ByteCodeMachine.matchAt(ByteCodeMachine.java:272)", "org.joni.Matcher.matchCheck(Matcher.java:304)", "org.joni.Matcher.searchInterruptible(Matcher.java:457)", "org.jruby.RubyRegexp$SearchMatchTask.run(RubyRegexp.java:273)", "org.jruby.RubyThread.executeBlockingTask(RubyThread.java:1066)", "org.jruby.RubyRegexp.matcherSearch(RubyRegexp.java:235)", "org.jruby.RubyRegexp.search19(RubyRegexp.java:1780)", "org.jruby.RubyRegexp.matchPos(RubyRegexp.java:1720)", "org.jruby.RubyRegexp.match19Common(RubyRegexp.java:1701)", "org.jruby.RubyRegexp.match_m19(RubyRegexp.java:1680)", "org.jruby.RubyRegexp$INVOKER$i$match_m19.call(RubyRegexp$INVOKER$i$match_m19.gen)", "org.jruby.internal.runtime.methods.JavaMethod$JavaMethodOneOrNBlock.call(JavaMethod.java:350)", "org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)", "rubyjit.Grok$$match_and_capture_5fff680283e4b4ebc4c0eb1732d88dbb187668931442407170.file(/opt/logstash/current/vendor/bundle/jruby/1.9/gems/jls-grok-0.11.2/lib/grok-pure.rb:177)", "rubyjit.Grok$$match_and_capture_5fff680283e4b4ebc4c0eb1732d88dbb187668931442407170.file(/opt/logstash/current/vendor/bundle/jruby/1.9/gems/jls-grok-0.11.2/lib/grok-pure.rb)", "org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:201)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:177)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:188)", "rubyjit.LogStash::Filters::Grok$$match_against_groks_f366a5df68c47d69f27e877618fa5dabc744d0fd1442407170.block_0$RUBY$file(/opt/logstash/current/vendor/bundle/jruby/1.9/gems/logstash-filter-grok-2.0.2/lib/logstash/filters/grok.rb:327)", "rubyjit$LogStash::Filters::Grok$$match_against_groks_f366a5df68c47d69f27e877618fa5dabc744d0fd1442407170$block_0$RUBY$file.call(rubyjit$LogStash::Filters::Grok$$match_against_groks_f366a5df68c47d69f27e877618fa5dabc744d0fd1442407170$block_0$RUBY$file)", "org.jruby.runtime.CompiledBlock19.yield(CompiledBlock19.java:135)", "org.jruby.runtime.Block.yield(Block.java:142)", "org.jruby.RubyArray.eachCommon(RubyArray.java:1606)", "org.jruby.RubyArray.each(RubyArray.java:1613)", "org.jruby.RubyArray$INVOKER$i$0$0$each.call(RubyArray$INVOKER$i$0$0$each.gen)", "org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:143)", "org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:154)", "rubyjit.LogStash::Filters::Grok$$match_against_groks_f366a5df68c47d69f27e877618fa5dabc744d0fd1442407170.file(/opt/logstash/current/vendor/bundle/jruby/1.9/gems/logstash-filter-grok-2.0.2/lib/logstash/filters/grok.rb:325)"], :level=>:error}
Error: Your application used more memory than the safety cap of 1G.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

yeah, that didn't work, can you show where exactly you tried to set this env variable?

I have an init script that I use to start logstash as a service.

In the init script I have this line:

LS_HEAP_SIZE=2048m

Head of the init.d file:

#!/bin/bash
#
# Logstash Indexer init file for starting up the Logstash Indexer daemon
#
# chkconfig:   - 20 80
# description: Starts and stops the  Logstash Indexer daemon.
PATH=/sbin:/usr/sbin:/bin:/usr/bin
DESC="Logstash Indexer "
NAME=logstash
DAEMON=/opt/logstash/current/bin/$NAME
DAEMON_ARGS="-f /etc/logstash/indexer1.conf &"
PIDFILE=/var/run/$NAME.pid
SCRIPTNAME=/etc/init.d/$NAME
LOG=/opt/logstash/current/bin/logstash_init.log

LS_HEAP_SIZE=2048m

Yes, you set it in the init.d environment but depending on how you start logstash after that it might or might not make it to the logstash process itself.

I start logstash with a service command. It should take the vars into account. I use that for setting the sincedb_path in other cases.

I am not that good with shell script to be able to debug it by just looking at pieces of it, but I am pretty sure the issue is with the variable somehow not getting through.

So some new info to bring to the table.

  • I was able to figure out why my env variable wasn't being set. I went and compared to other cases where I was setting vars in the init script (and they were working), and I found the difference. I wasn't 'export'ing the variable for memory correctly.

    I changed the line to:

    export LS_HEAP_SIZE=2048m

    And it correctly reflected in the new process when I started it up again.
    However, the process soon began filling up on memory again. The process did not crash because of the memory use, and is still running from yesterday, but 'top' shows a RES of 2.5G. I wouldn't normally care, since the process isn't crashing, but I checked another logstash process, on a different server (running a higher volume of data, and has been active for MUCH longer), and that process only has a RES of 435m.

  • it seems that the higher memory usage loosely correlates (by my own accounts, not thorough rigorous testing) with Redis being backlogged. In most of the cases of this that I've had recently, I will find that a system isn't indexing data, and upon investigation will see that the logstash indexer process has crashed. I originally thought that this crash was the root cause of redis being backlogged (as there's nothing to take stuff off of redis now), but now I'm wondering if I'm getting large volume spikes which flood Redis temporarily, and cause Logstash to crash, and then Redis continues filling up, compounding the problem. All that is speculation though, and I'm not completely sure how to test it.

A better update.
I think I found a better correlation.
It seems that occasionally I will get large debug logs flow into my system. When logstash tried to collect them, something about the large logs are causing a spike in memory.

I'm running some tests now, along with finding the huge log messages (it's all relative, they're only about 275k bytes per log line), I'm also finding that some of these huge messages have badly formed JSON, and are breaking the JSON filter. I'm going to test if the plugin failing to parse the malformed JSON coincides with Logstash not giving back memory.

So far, none of the testing I've done shows any linkage between the malformed JSON and the high memory usage.
I have discovered however that Logstash grows its heap somewhat relative to the size of the largest individual document it receives, and also somewhat relative to the number of those documents that are in the file. (My testing was done on a file with these log messages in them, not directly against a Redis list).

The part that confuses me though.. It seems like Logstash never seems to give BACK memory.

Some of my tests:
Notes:
both long and short log messages are parsed with the same Grok matcher
Short log size: around 100-150 bytes per line.
Long log size: around 275,000 bytes (275k)
Logstash heap size was 3G
I tested using a correctly formed JSON block in the long log messages unless otherwise specified

  • 100 lines of a short log
    Logstash heap grew to about 250M and stayed there.

  • 10,000 lines of the short log
    Logstash heap grew to about 275M and stayed there.

  • 500 lines of the short log, followed by 1 long log, followed by 500 lines of the short log
    Logstash heap grew to about 275M and stayed there

  • 500 lines of the short log, followed by 50 liners of the long log, followed by 500 lines of the short log
    Logstash heap grew to about 1.7G and stayed there.
    This test was an attempt to see if the Logstash heap would begin to contract once the larger log sizes were flushed through, and smaller logs were being parsed again. Did not happen.

  • 500 lines of the short log, followed by 500 lines of the long log, followed by 500 lines of the short log
    Logstash heap grew to 2.2G and stayed there.
    This test was an attempt to overflow the heap with memory.

  • 2000 lines of the long log
    Logstash heap grew to 2.2G and stayed there.
    I'm not sure why 2.2G is the magic number above which the heap will not go in my testing. But, for the testing, I trimmed out all other filtering except for ONLY the Grok match that was specific to these log formats. In reality my indexer file has about 5 different types of file formats it handles, along with replacements, json filters, geoip filters, and more. So loading those may be what grows the heap higher in the regular systems.

I will probably start doing more testing to see specifically how each piece of the indexer filtering affects the memory usage.

However, is it normal for Logstash to not give back memory once it is no longer needed?

It depends on how you measure this memory. If you look at it as memory used by the java process, then yes, java will never give this memory back to OS. But if you will take a look at heap used using Visual VM for example, you should see used heap going up and down.

Based on the testing I'm doing, it really just seems like it's a combination of:

  • Size of log messages (large messages)
  • Amount of large log messages
  • amount of filtering logic in place

that lead to the 3+ gig heap usage.
When I stripped out most of the filtering, memory seemed stable, it was just that I needed a lot more of it than I was giving.

I think this answers my question, but I'll probably continue doing testing on each piece of the filtering logic to get a better idea of how each one affects memory usage.

Thanks for your help guys.. I just need to give more memory to logstash :slight_smile: