Logstash stops forwarding to elasticsearch after short run

i notice that my issue is similar to Logstash runs as service but stops sending logs over http after some time but also note there was no resolution at that time. I'm hoping someone has found a solution since then.

My 7.4.2 ELK+filebeat on RHEL7 has enjoyed many months of uninterrupted service (since installation). We use it to ingest system logs from over 80 servers. This morning I noticed in Kibana that there were no results past approx 1:30am this morning.

With no messages in any logs that I could find that might point to the reason, I restarted logstash and saw that log messages were again being processed. Unfortunately, they only only continued for a short while and only up until the current time. Over the course of the day, I've repeated this process and its always the same thing: Logs are processed for a shorter and shorter amount of time depending on the time between restarts.
Each time I shutdown the service using "systemctl stop logstash", I see the exact same message repeated about a dozen times before the service eventually exits with failure:

[2020-07-28T15:33:01,338][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{"other"=>[{"thread_id"=>26, "name"=>"[main]<beats", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/logstash-input-beats-6.0.3-java/lib/logstash/inputs/beats.rb:204:in `run'"}], ["LogStash::Filters::Grok", {"match"=>["syslog_message", "%{DATA:bash_user} %{DATA:bash_usertty} %{TIMESTAMP_ISO8601:bash_ttydate} \\(%{DATA:bash_usersource}\\) %{DATA:bash_cmduser} \\[%{NUMBER:bash_id}\\] %{TIMESTAMP_ISO8601:bash_date} %{DATA:bash_command} \\[%{NUMBER:bash_return}\\]", "syslog_message", "%{DATA:bash_user} %{DATA:bash_usertty} %{TIMESTAMP_ISO8601:bash_ttydate} \\(%{DATA:bash_usersource}\\) %{DATA:bash_cmduser} \\[%{NUMBER:bash_id}\\]  \\[%{NUMBER:bash_return}\\]"], "add_tag"=>["bash"], "id"=>"4227456ebf7d90cb390424c0518b027ace0edcce8fd4d5e6fb7be467f9fff883", "remove_tag"=>["_unfiltered"]}]=>[{"thread_id"=>24, "name"=>"[main]>worker0", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/jls-grok-0.11.5/lib/grok-pure.rb:182:in `match'"}, {"thread_id"=>25, "name"=>"[main]>worker1", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/jls-grok-0.11.5/lib/grok-pure.rb:182:in `match'"}]}}

The "id" listed (4227456ebf7d90cb390424c0518b027ace0edcce8fd4d5e6fb7be467f9fff883) is repeated over and over, regardless of how many times I've restarted logstash.

I experience similar behaviour when trying to stop filebeat using "systemctl stop filebeat". The service eventually exits with an error:
filebeat.service stop-sigterm timed out. Killing.

I suspect that one log file or line may be causing logstash and/or filebeat to stop forwarding and am curious to know if the "id" from the logstash shutdown has any relevance to that file/line.
Alternatively, if anyone else has experienced similar behaviour and has found a solution, I'd be keen to hear it. I have been unable to get my stack back working again thus far.

What does your config look like?

I suspect you have some log lines that do not match that grok pattern. An unanchored grok pattern that has multiple DATA fields at the beginning is going to be very, very expensive when it fails to match. By default grok will spend up to 30 seconds on each log entry before it times out. If you have a batch of 125 log entries that do not match that means it will spend over an hour on them.

The timeout_millis option defaults to 30000. You could experiment with reducing this to maybe 3000, at which point it should take no more than 10 minutes to stop logstash. Of course you will get a bunch of grok timeout errors, but it will make it clear where the issue is.

Then read this blog post about anchoring patterns. Then consider whether some of the DATA fields could be NOTSPACE or some other pattern that is much cheaper to backtrack.

Still unsure of the cause of the extended pause in populating the current day's index but I restarted logstash and filebeat before I left for the day and upon looking this morning, the system is all working again normally (with yesterday's index fully populated).
Thanks for the replies.
@Badger, Thanks for the heads-up regarding the use of DATA for field matching. I will go through all of my grok filters today and replace DATA where possible. I am embarrassed to say it may have just been the easy route that saw me use DATA over a more relevant pattern match in the first place.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.