Hey,
Configuration
Elastic Cluster (3 master nodes, 2 coordination nodes, 6 data nodes) + 2 Logstash (5.2.1) nodes
Elastic Cluster status is green:
Nodes: 11
Indices: 23
Memory: 57GB / 249GB
Total Shards: 150
Unassigned Shards: 0
Documents: 3,607,973,417
Data: 5TB
Uptime: 3 days
Version: 5.2.1
Logstash filter:
Logstash uses the default config.
Problem
Each of the Logstash nodes starting without any problem and importing with an index rate of ~20k events/sec . There are some Errors in the Logstash logs like:
Error parsing csv [ . . . ] :exception=>#<CSV::MalformedCSVError: Illegal quoting in line 1.>}
Received an event that has a different character encoding than you configured. [ . . . ] :expected_charset=>"UTF-8"}
We are aware of them, but that are known problems with Bluecoat log files. 60k of 3,6 billions is "ok".
At some point the index rate drops to 300 events/sec and the cpu usage grows to 100% on all cpu cores. This happens on both Logstash Nodes, but independent from each other, e.g. on Logstash node 1 it occurs after 3 hour and on Logstash node 2 after 7 hour.
# ps -ef | grep java
root 21948 21041 99 Feb19 pts/2 12-07:39:39 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -Djava.awt.headless=true -Dfile.encoding=UTF-8 -XX:+HeapDumpOnOutOfMemoryError -Xmx1g -Xms256m -Xss2048k -Djffi.boot.library.path=/usr/share/logstash/vendor/jruby/lib/jni -Xbootclasspath/a:/usr/share/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/usr/share/logstash/vendor/jruby -Djruby.lib=/usr/share/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main /usr/share/logstash/lib/bootstrap/environment.rb logstash/runner.rb --path.settings=/etc/logstash/
# top -Hp 21948 | head -n23
top - 11:11:25 up 27 days, 21:10, 1 user, load average: 15.93, 15.56, 14.10
Threads: 84 total, 5 running, 79 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.2 us, 0.1 sy, 0.0 ni, 92.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65864420 total, 2103748 free, 1371972 used, 62388700 buff/cache
KiB Swap: 15615996 total, 15615996 free, 0 used. 63891584 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22024 root 20 0 8244812 1.089g 18872 S 99.9 1.7 1039:01 [main]>worker11
22026 root 20 0 8244812 1.089g 18872 S 99.9 1.7 1038:29 [main]>worker13
22028 root 20 0 8244812 1.089g 18872 S 99.9 1.7 1039:25 [main]>worker15
22013 root 20 0 8244812 1.089g 18872 S 93.8 1.7 1038:54 [main]>worker0
22014 root 20 0 8244812 1.089g 18872 R 93.8 1.7 1038:46 [main]>worker1
22015 root 20 0 8244812 1.089g 18872 S 93.8 1.7 1039:30 [main]>worker2
22016 root 20 0 8244812 1.089g 18872 R 93.8 1.7 1039:05 [main]>worker3
22017 root 20 0 8244812 1.089g 18872 R 93.8 1.7 1039:03 [main]>worker4
22018 root 20 0 8244812 1.089g 18872 S 93.8 1.7 1038:11 [main]>worker5
22019 root 20 0 8244812 1.089g 18872 S 93.8 1.7 1038:59 [main]>worker6
22020 root 20 0 8244812 1.089g 18872 R 93.8 1.7 1038:29 [main]>worker7
22021 root 20 0 8244812 1.089g 18872 S 93.8 1.7 1038:54 [main]>worker8
22022 root 20 0 8244812 1.089g 18872 S 93.8 1.7 1039:03 [main]>worker9
22023 root 20 0 8244812 1.089g 18872 S 93.8 1.7 1038:58 [main]>worker10
22025 root 20 0 8244812 1.089g 18872 S 93.8 1.7 1038:42 [main]>worker12
22027 root 20 0 8244812 1.089g 18872 S 93.8 1.7 1039:16 [main]>worker14
Any ideas how i might find if there is problem within the filter or any other ideas? There are no errors/warnings in the ES or Logstash logs at the point the problem occurs.
Thanks
Andreas