Reducing stalling_threads on restart of logstash

Good afternoon,

I've been using logstash for a while and have been trying to troubleshoot the cause of stalling_threads_info being thrown when restarting the service. I've tried switching out a variety of inputs / filters but the threads always seem to be stalled, no matter what is in place. Are there any ways to get some more information about what the cause may be for those in Logstash 7.2?

At one point it was showing a stalled thread to mutating:

[2019-07-29T14:28:35,929][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{"other"=>[{"thread_id"=>1742, "name"=>"[main]<file", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/logstash-input-file-4.1.10/lib/filewatch/watched_files_collection.rb:51:in `sort!'"}], ["LogStash::Filters::Mutate", {"add_tag"=>["19-filter-computerLogins"], "id"=>"3102807155779a12fa498fbd00014146c0d486c1ebfa0d5635fced002ec21d8d"}]=>[{"thread_id"=>1734, "name"=>"[main]>worker0", "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:239:in `block in start_workers'"}, {"thread_id"=>1735, "name"=>"[main]>worker1", "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:239:in `block in start_workers'"}, {"thread_id"=>1736, "name"=>"[main]>worker2", "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:239:in `block in start_workers'"}, {"thread_id"=>1737, "name"=>"[main]>worker3", "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:239:in `block in start_workers'"}]}}

another time it was processing a CSV file:

[2019-07-29T14:56:38,311][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{"other"=>[{"thread_id"=>66, "name"=>"[main]<file", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/logstash-input-file-4.1.10/lib/filewatch/stat/generic.rb:17:in `stat'"}, {"thread_id"=>70, "name"=>"[main]<jdbc", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/rufus-scheduler-3.0.9/lib/rufus/scheduler.rb:170:in `join'"}, {"thread_id"=>71, "name"=>"[main]<file", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/logstash-input-file-4.1.10/lib/filewatch/watch.rb:54:in `sleep'"}, {"thread_id"=>72, "name"=>"[main]<tcp", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/logstash-input-tcp-6.0.3-java/lib/logstash/inputs/tcp.rb:152:in `run'"}], ["LogStash::Filters::CSV", {"add_field"=>["received_at", "%{@timestamp}"], "id"=>"8b55b29a05ec76c8dbf717034a38e1832d8ef13e368ce749246d1ad267f7bebe", "separator"=>",", "columns"=>["action", "username", "eventTime", "ip"]}]=>[{"thread_id"=>59, "name"=>"[main]>worker0", "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:239:in `block in start_workers'"}, {"thread_id"=>60, "name"=>"[main]>worker1", "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:239:in `block in start_workers'"}, {"thread_id"=>61, "name"=>"[main]>worker2", "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:239:in `block in start_workers'"}, {"thread_id"=>62, "name"=>"[main]>worker3", "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:239:in `block in start_workers'"}]}}

and most recently is processing a Cisco Switch syslog entry:

[2019-07-29T14:58:39,454][WARN ][org.logstash.execution.ShutdownWatcherExt] {"inflight_count"=>0, "stalling_threads_info"=>{"other"=>[{"thread_id"=>65, "name"=>"[main]<file", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/logstash-input-file-4.1.10/lib/filewatch/watched_files_collection.rb:80:in `block in []='"}, {"thread_id"=>68, "name"=>"[main]<udp", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/logstash-input-udp-3.3.4/lib/logstash/inputs/udp.rb:121:in `select'"}], ["LogStash::Filters::Grok", {"match"=>["message", "<%{NUMBER:message_type_id}>%{BASE10NUM:event_no}: %{CISCOTIMESTAMP:cisco_time} %{WORD}: %%{NOTSPACE:cisco_type}: %{GREEDYDATA:description} %{NOTSPACE:interface}, %{GREEDYDATA:status}", "message", "<%{NUMBER:message_type_id}>%{BASE10NUM:event_no}: %{CISCOTIMESTAMP:cisco_time} %{WORD}: %%{NOTSPACE:cisco_type}: %{GREEDYDATA:description} %{NOTSPACE:interface}: %{GREEDYDATA:status}"], "add_tag"=>["cisco-switch-processed"], "id"=>"6645675d6b237fd7209905f92aafccc1b23bdd6b7703f0f1febb584441819132", "tag_on_failure"=>["cisco-switch-todo"]}]=>[{"thread_id"=>57, "name"=>"[main]>worker0", "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:239:in `block in start_workers'"}, {"thread_id"=>58, "name"=>"[main]>worker1", "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:239:in `block in start_workers'"}, {"thread_id"=>59, "name"=>"[main]>worker2", "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:239:in `block in start_workers'"}, {"thread_id"=>60, "name"=>"[main]>worker3", "current_call"=>"[...]/logstash-core/lib/logstash/java_pipeline.rb:239:in `block in start_workers'"}]}}

I'd love to have a bit more of a stable / reliable install and just looking for some tips on how to troubleshoot.

Thanks!

As the stalled threads info is a snapshot in time it shows which threads are still processing events and where execution is at the time of the snapshot.

As we don't want to lose events/data Logstash will wait until all workers have managed to output their batches downstream.

I see that you have the file input thread as stalled. This means that it is waiting on the queue, i.e. it has more events to add to the queue. It reads chunks of data and a chunk could have 1000's of events in it. While it is processing a chunk it does not heed the shutdown signal. You can make the chunk sizes much smaller and this will make the file input more responsive to the shutdown signal.

file_chunk_size is the setting that control this.

The default in bytes is 32768 (32K), you can set this as low as 256 with no appreciable change in throughput.