Stuck (~100 % CPU) pipeline workers

Hi!

We suddenly noticed that we were no longer receiving events in Kibana/Elasticsearch, and upon further investigation we saw that the server running Logstash was using 100 % CPU.

We've tried restarting both Logstash, and the server, but we see that the persistence queue has been filled up, and that for some reason, as soon as Logstash starts up and is done reading the pages in the queue, the 4 workers we have running for our pipeline suddenly jump to ~100 % CPU usage, and are not processing the events in the queue, or any other events for that matter.

How can we go about troubleshooting what's causing the issue, besides it having something to do with the match filter? Like, what actual event is it trying to process that's causing it to halt completely? Setting "log.level: debug" didn't really provide any new insights into what's causing the problem. Neither did setting slowlogs, since nothing is showing up there either.

We're a bit stuck, to be honest. :confused:

root@elk-l-001:~# curl -XGET 'localhost:9600/_node/hot_threads?human=true&threads=5'
::: {}
Hot threads at 2019-01-10T04:04:25+01:00, busiestThreads=5: 
================================================================================
82.49 % of cpu usage, state: runnable, thread name: 'Ruby-0-Thread-28', thread id: 53 
:1
    org.joni.ByteCodeMachine.matchAt(ByteCodeMachine.java:228)
    org.joni.Matcher.matchCheck(Matcher.java:304)
    org.joni.Matcher.searchInterruptible(Matcher.java:469)
--------------------------------------------------------------------------------
82.41 % of cpu usage, state: runnable, thread name: 'Ruby-0-Thread-25', thread id: 50 
:1
    org.joni.ByteCodeMachine.matchAt(ByteCodeMachine.java:228)
    org.joni.Matcher.matchCheck(Matcher.java:304)
    org.joni.Matcher.searchInterruptible(Matcher.java:469)
--------------------------------------------------------------------------------
82.25 % of cpu usage, state: runnable, thread name: 'Ruby-0-Thread-27', thread id: 52 
:1
    org.joni.ByteCodeMachine.matchAt(ByteCodeMachine.java:272)
    org.joni.Matcher.matchCheck(Matcher.java:304)
    org.joni.Matcher.searchInterruptible(Matcher.java:469)
--------------------------------------------------------------------------------
81.57 % of cpu usage, state: runnable, thread name: 'Ruby-0-Thread-26', thread id: 51 
:1
    org.joni.ByteCodeMachine.matchAt(ByteCodeMachine.java:272)
    org.joni.Matcher.matchCheck(Matcher.java:304)
    org.joni.Matcher.searchInterruptible(Matcher.java:469)
--------------------------------------------------------------------------------
0.61 % of cpu usage, state: timed_waiting, thread name: 'LogStash::Runner', thread id: 1 
    java.lang.Object.wait(Native Method)
    java.lang.Thread.join(Thread.java:1260)
    org.jruby.internal.runtime.NativeThread.join(NativeThread.java:76)
--------------------------------------------------------------------------------

1 Like

No tips on how we can troubleshoot it?

I'm pretty sure now I know exactly which grok filter is causing the problem, but we don't understand why it's a problem all of the sudden.

What we would like to do is to be able to inspect the event that's causing all of the pipeline workers to hang.

I would imagine it might be of some interest to the Elastic team as well.. One thing if there's a slow filter, that it takes several seconds, or minutes, or whatever to chew through a particular event. But right now it just stops working, and no events are getting emitted at all because of this...

The first thing I would look at is Elasticsearch. If it is not able to receive data, things will pile up in Logstash. Is there anything in the Elasticsearch logs? Do you have plenty of disk space left so indices have not gone into read-only mode due to hitting flood stage watermarks?

I have no reason to believe that this is caused by Elasticsearch, since everything looks good in Monitoring, and ES is processing data from other sources just fine (Beats -> ES Ingest).

Seems like this is isolated to just that particular pipeline in Logstash, combined with the persitent queue that has now built up attached to that pipeline, as the PQ is at it's max of ~60GB now.

I've attached a few screenshots from Monitoring to show what I believe to be a healthy ES cluster.

Doing a tail on all the master and data nodes does not reveal any errors since startup, with only some GC info messages seen on one of the master nodes (elk-e-mas-001).

If you are getting data from Beats it is most likely something in Logstash. What does the pipeline look like? Is there anything in Logstash logs?

The pipeline is relatively complex, with almost a 1000 lines, so I don't think it's relevant/helpful to paste all of it here.

The Logstash logs do not provide any helpful information, even with "log.level: debug", which is why we've been struggling to troubleshoot this.

Currently it just outputs this over and over again:

root@elk-l-001:~# tail -f /var/log/logstash/logstash-plain.log 
[2019-01-10T21:22:54,219][TRACE][logstash.agent           ] Converge results {:success=>true, :failed_actions=>[], :successful_actions=>[]}
[2019-01-10T21:22:56,554][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2019-01-10T21:22:56,554][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2019-01-10T21:22:59,121][DEBUG][org.logstash.execution.PeriodicFlush] Pushing flush onto pipeline.
[2019-01-10T21:22:59,201][TRACE][logstash.configmanagement.elasticsearchsource] Fetch remote config pipeline {:pipeline_ids=>["main"]}
[2019-01-10T21:22:59,215][DEBUG][logstash.agent           ] Converging pipelines state {:actions_count=>0}
[2019-01-10T21:22:59,215][TRACE][logstash.agent           ] Converge results {:success=>true, :failed_actions=>[], :successful_actions=>[]}
[2019-01-10T21:23:01,558][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2019-01-10T21:23:01,558][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2019-01-10T21:23:02,813][DEBUG][logstash.inputs.metrics  ] Metrics input: received a new snapshot {:created_at=>2019-01-10 21:23:02 +0100, :snapshot=>#<LogStash::Instrument::Snapshot:0x613a8713>}
[2019-01-10T21:23:04,121][DEBUG][org.logstash.execution.PeriodicFlush] Pushing flush onto pipeline.
[2019-01-10T21:23:04,201][TRACE][logstash.configmanagement.elasticsearchsource] Fetch remote config pipeline {:pipeline_ids=>["main"]}

However. Something interesting does happen when we try to stop the Logstash service, and which is the reason for why we think we found which filter is causing the problem.

Because if we try to stop the service, the graceful shutdown mechanisms of Logstash kick in, and they tell us what's preventing it from shutting down, as the following message repeats over and over again, until it finally times out and kills the process:

[2019-01-10T21:29:04,735][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{"other"=>[{"thread_id"=>56, "name"=>"[main]<tcp", "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/logstash-input-tcp-5.2.0-java/lib/logstash/inputs/tcp.rb:182:in `close'"}], ["LogStash::Filters::Mutate", {"rename"=>{"app_logger"=>"[app][logger]"}, "id"=>"25dcb0fb53b861b229ac09f9f541f857eaeb33a63b67441e2228cd41f6955501"}]=>[{"thread_id"=>50, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:219:in `match'"}, {"thread_id"=>51, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:219:in `match'"}, {"thread_id"=>52, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:219:in `match'"}, {"thread_id"=>53, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:219:in `match'"}]}}

And this leads me to my initial question, which is how can we find out what event it's having trouble with matching/passing that grok filter?

Because as you can see in the screenshot above, we've processed billions of events, most of which went through this very same pipeline. We have not done any changes to the pipeline for weeks.

The only change that comes to mind that was done recently to the Elastic Stack was the upgrade to 6.5.1.

But again, right after the upgrade, we still processed millions of requests for several days through this pipeline.

So my guess, is that some malformed event has made its way into the queue, and is now wrecking havoc on our pipeline causing it to go into some weird loop.

And if we can identify that event, my hope is that we then can just remove it from the appropriate page (or pages) in the Persistent Queue, and everything would start working again. Hopefully we will also be able to understand why it was able to completely lock up a pipeline filter, and hopefully either mitigate it within the filters, or maybe it's a bug that needs to be fixed in grok or somewhere else in Logstash.

You can drill down on your pipeline in logstash from the monitoring view in kibana and check out the counters and detail of the pipeline Inputs / Filters / Outputs

If you structured your LS pipeline with ID => "details" this would help.

Hi!

As mentioned, we're pretty much sure what filter is causing the problem, but what we would like to know is what event, what actual message/text/string/whatever is causing it to crash.

It's reading this from the persistent queue, so it should be possible to identify which event is "at the top" of the queue, and thus the one crashing the logstash pipeline?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.