Logstash consuming huge amount of CPU - mem leak? garbage collector?

Details:

  • dockerized environment, isolated, only ELK
  • logstash version 2.3.1 (Apr 4, 2016)
  • logstash-input-kafka-2.0.6
  • elasticsearch output
  • 200 records/second
  • 8 cores, 8GB of RAM, 2GB logstash heap (cpu consumation ("ni") goes to 100% on all eight cores)
  • no other processes except elasticsearch consume more than 1% of cpu/mem
  • after some time (usually several hours) logstash stops sending data to ES
  • nothing in logstash logs
  • when logstash service restarted everything works for some time

logstash process:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
systemd+ 15395  457  5.3 6953124 424628 ?      SNl  08:57  63:19 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/logstash -XX:HeapDumpPath=/opt/logstash/heapdump.hprof -Xbootclasspath/a:/opt/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log --auto-reload

Logstash subprocesses (top -Hp)

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 4637 systemd+  39  19 7045796 2,201g   5728 R 97,7 28,8 292:53.17 [main]>worker7
 4632 systemd+  39  19 7045796 2,201g   5728 R 97,3 28,8 140:04.15 [main]>worker2
 4635 systemd+  39  19 7045796 2,201g   5728 R 97,3 28,8 223:43.32 [main]>worker5
 4630 systemd+  39  19 7045796 2,201g   5728 R 97,0 28,8 281:26.59 [main]>worker0
 4631 systemd+  39  19 7045796 2,201g   5728 R 96,3 28,8 160:18.87 [main]>worker1
 4636 systemd+  39  19 7045796 2,201g   5728 R 96,3 28,8 134:52.64 [main]>worker6
 4517 systemd+  39  19 7045796 2,201g   5728 R 79,7 28,8 183:59.43 LogStash::Runne
 4493 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:23.33 java
 4495 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:18.66 java
 4497 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:22.12 java
 4498 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:21.07 java                                                                                 
 ...

Logstash configuration

several blocks like this:

kafka {
    zk_connect => "pbdcnt1:2181,pbdcnt2:2181,pbdcnt3:2181"
    topic_id => "transactions"
    tags => ["transactions"]
    group_id => "prod_elastic_fraud"
    codec => plain {
        charset => "Windows-1250"
    }
}

several like this:

filter {
    if "transactions" in [tags] {
        grok{
            match=>["message","(?<log_time>.*?)? ?(?<src>.*?)? ?(?<insert_time>.*?);(?<col1>.*?);(?<col2>.*?);...;(?<colN>.*?)"]
        }
    }

and output is this:

output {
  if [tags] {
    elasticsearch {
      hosts => ["localhost"]
      index => "%{tags}-%{+YYYY.MM.dd}"
    }
  }
}

Leak somewhere? Garbage collection time constantly increasing (jstat)

S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC                 GCC  
  
  0.00  52.87  19.08  13.50  59.96  57919 2977.900  4916  206.461 3184.361 Allocation Failure   No GC               
  0.00  52.87  84.52  13.50  59.96  57919 2977.900  4916  206.461 3184.361 Allocation Failure   No GC               
 72.04   0.00  60.68  13.50  59.96  57920 2977.954  4916  206.461 3184.415 Allocation Failure   No GC               
  0.00  65.02  64.85  13.51  59.96  57921 2978.009  4916  206.461 3184.471 Allocation Failure   No GC               
 58.54   0.00  40.52  13.51  59.96  57922 2978.058  4916  206.461 3184.520 Allocation Failure   No GC               
 58.54   0.00  98.01  13.51  59.96  57922 2978.058  4916  206.461 3184.520 Allocation Failure   No GC               
  0.00  58.06  99.74  13.51  59.96  57924 2978.107  4916  206.461 3184.569 Allocation Failure 

jstack for the process:

Are you seeing any errors in the Logstash.log file ?

Try using kafka-consumer-offset-checker.sh to see if the Kafka topics are being consumed while it is spinning.

You could also try an else output stdout to if anything isn't getting the tags field.

No, there is only errors connected with unable to parse some lines, like:
... must not contain the following characters [\, /, *, ?, ", <, >, |, , ,] and similar, nothing "serious".

Yes, topics are being consumed and offset is set correctly. If I stop ELK and restart, it will continue where it left.
What do you mean about anything not getting tags field? What would be the case? And consequence?

Well if the "if [tags]" is false then the message would effectively be dropped. Since your offsets are moving but nothing is in Elasticsearch I'd suspect that something is up with the logic. That would explain the spike in CPU as Kafka is probably slamming those messages through very quickly as it is basically outputting to /dev/null. Otherwise, perhaps it only manifests over time.

Are you able to provide your config is full? Only having pieces can sometimes mask an issue. Apologies if you already provided it all.

Sorry if I wasn't completely clear.
The data is propagated to elasticsearch correctly. Logstash process is getting more and more CPU consumption until one point, and THEN is when log stash stops propagating data to elasticsearch.
That was the reason I included all that JVM stats.

I will provide full config later, not at home at the moment.

Thanks,
Eddie

Sadly, but it seems that regex is the one consuming too much CPU. When I removed regex and used CSV filter with columns option (which is basically what I wanted to do in the first place), CPU is now constantly around 150% (which also seems too high for 200 transactions/s). But hey, it is what it is currently.

What stays mistery is why it works for a while and then simply stops. Maybe to put more workers...