Details:
- dockerized environment, isolated, only ELK
- logstash version 2.3.1 (Apr 4, 2016)
- logstash-input-kafka-2.0.6
- elasticsearch output
- 200 records/second
- 8 cores, 8GB of RAM, 2GB logstash heap (cpu consumation ("ni") goes to 100% on all eight cores)
- no other processes except elasticsearch consume more than 1% of cpu/mem
- after some time (usually several hours) logstash stops sending data to ES
- nothing in logstash logs
- when logstash service restarted everything works for some time
logstash process:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
systemd+ 15395 457 5.3 6953124 424628 ? SNl 08:57 63:19 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/logstash -Xmx2g -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Djava.io.tmpdir=/opt/logstash -XX:HeapDumpPath=/opt/logstash/heapdump.hprof -Xbootclasspath/a:/opt/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log --auto-reload
Logstash subprocesses (top -Hp)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4637 systemd+ 39 19 7045796 2,201g 5728 R 97,7 28,8 292:53.17 [main]>worker7
4632 systemd+ 39 19 7045796 2,201g 5728 R 97,3 28,8 140:04.15 [main]>worker2
4635 systemd+ 39 19 7045796 2,201g 5728 R 97,3 28,8 223:43.32 [main]>worker5
4630 systemd+ 39 19 7045796 2,201g 5728 R 97,0 28,8 281:26.59 [main]>worker0
4631 systemd+ 39 19 7045796 2,201g 5728 R 96,3 28,8 160:18.87 [main]>worker1
4636 systemd+ 39 19 7045796 2,201g 5728 R 96,3 28,8 134:52.64 [main]>worker6
4517 systemd+ 39 19 7045796 2,201g 5728 R 79,7 28,8 183:59.43 LogStash::Runne
4493 systemd+ 39 19 7045796 2,201g 5728 S 1,7 28,8 33:23.33 java
4495 systemd+ 39 19 7045796 2,201g 5728 S 1,7 28,8 33:18.66 java
4497 systemd+ 39 19 7045796 2,201g 5728 S 1,7 28,8 33:22.12 java
4498 systemd+ 39 19 7045796 2,201g 5728 S 1,7 28,8 33:21.07 java
...
Logstash configuration
several blocks like this:
kafka {
zk_connect => "pbdcnt1:2181,pbdcnt2:2181,pbdcnt3:2181"
topic_id => "transactions"
tags => ["transactions"]
group_id => "prod_elastic_fraud"
codec => plain {
charset => "Windows-1250"
}
}
several like this:
filter {
if "transactions" in [tags] {
grok{
match=>["message","(?<log_time>.*?)? ?(?<src>.*?)? ?(?<insert_time>.*?);(?<col1>.*?);(?<col2>.*?);...;(?<colN>.*?)"]
}
}
and output is this:
output {
if [tags] {
elasticsearch {
hosts => ["localhost"]
index => "%{tags}-%{+YYYY.MM.dd}"
}
}
}
Leak somewhere? Garbage collection time constantly increasing (jstat)
S0 S1 E O P YGC YGCT FGC FGCT GCT LGCC GCC
0.00 52.87 19.08 13.50 59.96 57919 2977.900 4916 206.461 3184.361 Allocation Failure No GC
0.00 52.87 84.52 13.50 59.96 57919 2977.900 4916 206.461 3184.361 Allocation Failure No GC
72.04 0.00 60.68 13.50 59.96 57920 2977.954 4916 206.461 3184.415 Allocation Failure No GC
0.00 65.02 64.85 13.51 59.96 57921 2978.009 4916 206.461 3184.471 Allocation Failure No GC
58.54 0.00 40.52 13.51 59.96 57922 2978.058 4916 206.461 3184.520 Allocation Failure No GC
58.54 0.00 98.01 13.51 59.96 57922 2978.058 4916 206.461 3184.520 Allocation Failure No GC
0.00 58.06 99.74 13.51 59.96 57924 2978.107 4916 206.461 3184.569 Allocation Failure
jstack for the process: