I am curious if others have had this problem and/or if someone understand WHY this occurs. I had the below bit of code in a Logstash instance that was consuming from one Kafka cluster performing some filtering and then output to another Kafka cluster. Under BAU operations this was working fine. However if the instance was restarted with from earliest a condition would occur that caused the CPU usage to spike and consume nearly all CPU cycles on the server. We figured out that the "required" command is NOT actually needed and removed it from the configuration to resolve the problem. However we were hoping someone shed some light on why this was occurring.
code => "
# Convert event time to epoch event.set('[@metadata][current_time]', Time.now.to_datetime.strftime('%Q')); "
Below is what we observed while this was occurring.
- All the worker threads were constantly running
- The worker threads being running blocked the Kafka consumer threads from being able to consume data quickly
- We ran a number of tests trying different combinations
- Removing other parts of the filter and the only thing that made any difference was the "require 'date'"
- Changing plugin options for batch size and delay which made no meaningful change
- Changing Kafka consumer options poll timeout, max poll records, fetch wait ms, etc but they made no meaningful change either
Below is part of a thread dump, showing the first few lines of a worker while this problem was occurring. In it you can see "jnr.posix.JavaSecuredFile.exists" near the top. When we performed a CPU sample in Java VisualVM that was taking up about 70-80% of the CPU.
"Ruby-0-Thread-17@[main]>worker8: :1" - Thread t@56
at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
- VM with 32 cores & 48GB RAM
- Logstash 6.2.2
- Input Kafka 8.0.4
- Output Kafka 6.2.2
- Filter Ruby 3.1.3