Logstash consuming huge amount of CPU - mem leak? garbage collector?

Eduard_Camaj · May 5, 2016, 2:57pm

Details:

dockerized environment, isolated, only ELK
logstash version 2.3.1 (Apr 4, 2016)
logstash-input-kafka-2.0.6
elasticsearch output
200 records/second
8 cores, 8GB of RAM, 2GB logstash heap (cpu consumation ("ni") goes to 100% on all eight cores)
no other processes except elasticsearch consume more than 1% of cpu/mem
after some time (usually several hours) logstash stops sending data to ES
nothing in logstash logs
when logstash service restarted everything works for some time

logstash process:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
systemd+ 15395  457  5.3 6953124 424628 ?      SNl  08:57  63:19 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/logstash -XX:HeapDumpPath=/opt/logstash/heapdump.hprof -Xbootclasspath/a:/opt/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log --auto-reload

Logstash subprocesses (top -Hp)

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 4637 systemd+  39  19 7045796 2,201g   5728 R 97,7 28,8 292:53.17 [main]>worker7
 4632 systemd+  39  19 7045796 2,201g   5728 R 97,3 28,8 140:04.15 [main]>worker2
 4635 systemd+  39  19 7045796 2,201g   5728 R 97,3 28,8 223:43.32 [main]>worker5
 4630 systemd+  39  19 7045796 2,201g   5728 R 97,0 28,8 281:26.59 [main]>worker0
 4631 systemd+  39  19 7045796 2,201g   5728 R 96,3 28,8 160:18.87 [main]>worker1
 4636 systemd+  39  19 7045796 2,201g   5728 R 96,3 28,8 134:52.64 [main]>worker6
 4517 systemd+  39  19 7045796 2,201g   5728 R 79,7 28,8 183:59.43 LogStash::Runne
 4493 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:23.33 java
 4495 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:18.66 java
 4497 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:22.12 java
 4498 systemd+  39  19 7045796 2,201g   5728 S  1,7 28,8  33:21.07 java                                                                                 
 ...

Logstash configuration

several blocks like this:

kafka {
    zk_connect => "pbdcnt1:2181,pbdcnt2:2181,pbdcnt3:2181"
    topic_id => "transactions"
    tags => ["transactions"]
    group_id => "prod_elastic_fraud"
    codec => plain {
        charset => "Windows-1250"
    }
}

several like this:

filter {
    if "transactions" in [tags] {
        grok{
            match=>["message","(?<log_time>.*?)? ?(?<src>.*?)? ?(?<insert_time>.*?);(?<col1>.*?);(?<col2>.*?);...;(?<colN>.*?)"]
        }
    }

and output is this:

output {
  if [tags] {
    elasticsearch {
      hosts => ["localhost"]
      index => "%{tags}-%{+YYYY.MM.dd}"
    }
  }
}

Leak somewhere? Garbage collection time constantly increasing (jstat)

S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT    LGCC                 GCC  
  
  0.00  52.87  19.08  13.50  59.96  57919 2977.900  4916  206.461 3184.361 Allocation Failure   No GC               
  0.00  52.87  84.52  13.50  59.96  57919 2977.900  4916  206.461 3184.361 Allocation Failure   No GC               
 72.04   0.00  60.68  13.50  59.96  57920 2977.954  4916  206.461 3184.415 Allocation Failure   No GC               
  0.00  65.02  64.85  13.51  59.96  57921 2978.009  4916  206.461 3184.471 Allocation Failure   No GC               
 58.54   0.00  40.52  13.51  59.96  57922 2978.058  4916  206.461 3184.520 Allocation Failure   No GC               
 58.54   0.00  98.01  13.51  59.96  57922 2978.058  4916  206.461 3184.520 Allocation Failure   No GC               
  0.00  58.06  99.74  13.51  59.96  57924 2978.107  4916  206.461 3184.569 Allocation Failure

jstack for the process:

m0tek · May 7, 2016, 9:44pm

Are you seeing any errors in the Logstash.log file ?

Joe_Lawson · May 9, 2016, 6:52pm

Try using kafka-consumer-offset-checker.sh to see if the Kafka topics are being consumed while it is spinning.

You could also try an else output stdout to if anything isn't getting the tags field.

Eduard_Camaj · May 11, 2016, 11:45am

No, there is only errors connected with unable to parse some lines, like:
... must not contain the following characters [\, /, *, ?, ", <, >, |, , ,] and similar, nothing "serious".

Eduard_Camaj · May 11, 2016, 11:46am

Yes, topics are being consumed and offset is set correctly. If I stop ELK and restart, it will continue where it left.
What do you mean about anything not getting tags field? What would be the case? And consequence?

Joe_Lawson · May 11, 2016, 5:10pm

Well if the "if [tags]" is false then the message would effectively be dropped. Since your offsets are moving but nothing is in Elasticsearch I'd suspect that something is up with the logic. That would explain the spike in CPU as Kafka is probably slamming those messages through very quickly as it is basically outputting to /dev/null. Otherwise, perhaps it only manifests over time.

Are you able to provide your config is full? Only having pieces can sometimes mask an issue. Apologies if you already provided it all.

Eduard_Camaj · May 11, 2016, 6:50pm

Sorry if I wasn't completely clear.
The data is propagated to elasticsearch correctly. Logstash process is getting more and more CPU consumption until one point, and THEN is when log stash stops propagating data to elasticsearch.
That was the reason I included all that JVM stats.

I will provide full config later, not at home at the moment.

Thanks,
Eddie

Eduard_Camaj · May 17, 2016, 9:58am

Sadly, but it seems that regex is the one consuming too much CPU. When I removed regex and used CSV filter with columns option (which is basically what I wanted to do in the first place), CPU is now constantly around 150% (which also seems too high for 200 transactions/s). But hey, it is what it is currently.

What stays mistery is why it works for a while and then simply stops. Maybe to put more workers...

Topic		Replies	Views
Why my Logstash is consuming more than 80% of CPU? Logstash	10	1026	April 4, 2017
Why is my Logstash using huge CPU and how to fix it? Logstash	6	2404	June 12, 2017
Kafka plugin makes logstash go into 100% CPU infinite loop Logstash	2	1105	July 6, 2017
Logstash consuming a lot of CPU? Logstash	1	618	July 6, 2017
Logstash high cpu usage when idle Logstash	2	2755	January 10, 2018

Logstash consuming huge amount of CPU - mem leak? garbage collector?

Related topics