Logstash slows down over time

jbeck · April 4, 2017, 4:55am

I'm having a weird problem with logstash and I'd like some help debugging it.

Basically our logstash cluster slows down over a period of a couple of weeks. What I mean by slows down is that it starts rejecting messages from filebeat. Ultimately this is causing filebeat on a lot of our hosts to take up a lot of memory since a lot of their log entries are not getting through. I have been able to saturate our CPUs with enough log messages. When it slows down the CPU usage drops to less than 50%.

Restarting logstash on the hosts seems to fix it. The CPU usage goes up to 100%, message throughput more than doubled. Then our autoscaling policies kicked in and added a few more hosts. The throughput with a few more hosts ended up being almost triple the original.

Let me know any details that seem relevant and I will expand on how we have things set up.

jbeck · April 4, 2017, 5:23am

Half an hour later the throughput is 5 times what it was before I started, and there are a whole heap of hosts handling the load. So there were definitely messages available.

warkolm · April 4, 2017, 9:33am

What version? LS and java.
What OS?
What's your config look like?

jbeck · April 4, 2017, 10:44pm

Logstash 5.3. Java is openjdk 1.8. OS Ubuntu 16.04.

Do you mean the logstash configuration. It's pretty standard. These hosts are filebeat input to SQS output. All they do is read messages and write them to SQS, there are no filters.

guyboertje · April 5, 2017, 2:20pm

This is a hard question.

Take a thread dump about 2 minutes from LS start - this will be a baseline.

Can you use the monitoring API to periodically poll the stats to see when the slowdown occurs?
Then you can:

look at the log files around that time to see if there are errors
do a JVM thread dump so we can see what LS is doing internally.

zip up the thread dumps and post them here.

https://helpx.adobe.com/experience-manager/kb/TakeThreadDump.html

jbeck · April 7, 2017, 3:38am

Thanks I will do that. This might take a while since I have to hope that the instance I take the reference dump on doesn't get shut down due to autoscaling.

jbeck · April 11, 2017, 2:36am

I have two thread dumps, but I am only allowed to upload images.

If this helps, we're actually connecting filebeat to logstash via a load balancer. The logs and tcpdump from the load balancer indicate that logstash is sending TCP resets after running for only a couple of minutes.

jbeck · April 13, 2017, 12:51am

It seems like this was solved by turning off publish_async on filebeat.

However it still begs the question as to why I see so many errors from logstash. There are a lot of ERR Failed to publish events caused by: EOF in the filebeat logs. I also see a lot of random resets coming from logstash.

warkolm · April 13, 2017, 3:12am

Firewall, load balancer?

jbeck · April 13, 2017, 3:30am

I think I ruled out the load balancers, and I'm pretty sure I saw the same thing before we started using load balancers. There are no relevant firewalls, only AWS security groups.

Watching the traffic from the load balancer it looks like the connection resets are coming from the logstash hosts.

The other strange thing is that there are a lot errors for a half a minute when I start filebeat. After that it seems ok for a while.

jbeck · April 19, 2017, 12:50am

Even with publish_async: false there is still a problem. Filebeat is no longer taking up 1.5G of memory on our hosts, but logstash still slows down after a couple of weeks. I noticed that filebeat was several log rotations behind on several hosts, and restarting logstash fixed it.

Any suggestions on how to upload those stack dumps?

guyboertje · April 21, 2017, 8:01am

Github gist pastebin etc

joshuaspence · April 24, 2017, 6:04am

I work with @jbeck and can provide some additional information. I don't know if it's helpful or not, but here's the last 100 lines from the Logstash log before I restarted the service:

gist.github.com

https://gist.github.com/joshuaspence/6afb4266e8eb885da6f9f2dc93405906

logstash-json.log

{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1493002519490,"thread":"[main]>worker9","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":387978}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1493005120877,"thread":"[main]>worker10","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":350940}}
{"level":"WARN","loggerName":"logstash.runner","timeMillis":1493007445806,"thread":"SIGTERM handler","logEvent":{"message":"SIGTERM received. Shutting down the agent."}}
{"level":"WARN","loggerName":"logstash.agent","timeMillis":1493007445816,"thread":"LogStash::Runner","logEvent":{"message":"stopping pipeline","id":"main"}}
{"level":"WARN","loggerName":"logstash.shutdownwatcher","timeMillis":1493007450856,"thread":"Ruby-0-Thread-52: /usr/share/logstash/logstash-core/lib/logstash/shutdown_watcher.rb:31","logEvent":{"message":"{\"inflight_count\"=>751, \"stalling_thread_info\"=>{\"other\"=>[{\"thread_id\"=>36, \"name\"=>\"[main]<beats\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-input-beats-3.1.12-java/lib/logstash/inputs/beats.rb:213:in `run'\"}, {\"thread_id\"=>37, \"name\"=>\"[main]<file\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/filewatch-0.9.0/lib/filewatch/watch.rb:288:in `sleep'\"}, {\"thread_id\"=>38, \"name\"=>\"[main]<heartbeat\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/stud-0.0.22/lib/stud/interval.rb:84:in `sleep'\"}], [\"LogStash::Filters::Metrics\", {\"add_field\"=>{\"[@metadata][self]\"=>\"true\", \"type\"=>\"metric\"}, \"add_tag\"=>\"metric\", \"meter\"=>[\"events\"], \"id\"=>\"5b1ebbd4498e61e6d13242fe3e7e1a935a4e99d5-4\"}]=>[{\"thread_id\"=>23, \"name\"=>\"[main]>worker0\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>24, \"name\"=>\"[main]>worker1\", \"current_call\"=>\"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:127:in `synchronize'\"}, {\"thread_id\"=>25, \"name\"=>\"[main]>worker2\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>26, \"name\"=>\"[main]>worker3\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>27, \"name\"=>\"[main]>worker4\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>28, \"name\"=>\"[main]>worker5\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>29, \"name\"=>\"[main]>worker6\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>30, \"name\"=>\"[main]>worker7\", \"current_call\"=>\"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:127:in `synchronize'\"}, {\"thread_id\"=>31, \"name\"=>\"[main]>worker8\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>32, \"name\"=>\"[main]>worker9\", \"current_call\"=>\"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:127:in `synchronize'\"}, {\"thread_id\"=>33, \"name\"=>\"[main]>worker10\", \"current_call\"=>\"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:127:in `synchronize'\"}, {\"thread_id\"=>34, \"name\"=>\"[main]>worker11\", \"current_call\"=>\"[...]/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:127:in `synchronize'\"}]}}"}}
{"level":"WARN","loggerName":"logstash.shutdownwatcher","timeMillis":1493007455832,"thread":"Ruby-0-Thread-52: /usr/share/logstash/logstash-core/lib/logstash/shutdown_watcher.rb:31","logEvent":{"message":"{\"inflight_count\"=>751, \"stalling_thread_info\"=>{[\"LogStash::Filters::Metrics\", {\"add_field\"=>{\"[@metadata][self]\"=>\"true\", \"type\"=>\"metric\"}, \"add_tag\"=>\"metric\", \"meter\"=>[\"events\"], \"id\"=>\"5b1ebbd4498e61e6d13242fe3e7e1a935a4e99d5-4\"}]=>[{\"thread_id\"=>23, \"name\"=>\"[main]>worker0\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>25, \"name\"=>\"[main]>worker2\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>26, \"name\"=>\"[main]>worker3\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>27, \"name\"=>\"[main]>worker4\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>28, \"name\"=>\"[main]>worker5\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>29, \"name\"=>\"[main]>worker6\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>31, \"name\"=>\"[main]>worker8\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}]}}"}}
{"level":"ERROR","loggerName":"logstash.shutdownwatcher","timeMillis":1493007455832,"thread":"Ruby-0-Thread-52: /usr/share/logstash/logstash-core/lib/logstash/shutdown_watcher.rb:31","logEvent":{"message":"The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information."}}
{"level":"WARN","loggerName":"logstash.shutdownwatcher","timeMillis":1493007460828,"thread":"Ruby-0-Thread-52: /usr/share/logstash/logstash-core/lib/logstash/shutdown_watcher.rb:31","logEvent":{"message":"{\"inflight_count\"=>751, \"stalling_thread_info\"=>{[\"LogStash::Filters::Metrics\", {\"add_field\"=>{\"[@metadata][self]\"=>\"true\", \"type\"=>\"metric\"}, \"add_tag\"=>\"metric\", \"meter\"=>[\"events\"], \"id\"=>\"5b1ebbd4498e61e6d13242fe3e7e1a935a4e99d5-4\"}]=>[{\"thread_id\"=>23, \"name\"=>\"[main]>worker0\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>25, \"name\"=>\"[main]>worker2\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>26, \"name\"=>\"[main]>worker3\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>27, \"name\"=>\"[main]>worker4\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>28, \"name\"=>\"[main]>worker5\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>29, \"name\"=>\"[main]>worker6\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>31, \"name\"=>\"[main]>worker8\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}]}}"}}
{"level":"WARN","loggerName":"logstash.shutdownwatcher","timeMillis":1493007465824,"thread":"Ruby-0-Thread-52: /usr/share/logstash/logstash-core/lib/logstash/shutdown_watcher.rb:31","logEvent":{"message":"{\"inflight_count\"=>751, \"stalling_thread_info\"=>{[\"LogStash::Filters::Metrics\", {\"add_field\"=>{\"[@metadata][self]\"=>\"true\", \"type\"=>\"metric\"}, \"add_tag\"=>\"metric\", \"meter\"=>[\"events\"], \"id\"=>\"5b1ebbd4498e61e6d13242fe3e7e1a935a4e99d5-4\"}]=>[{\"thread_id\"=>23, \"name\"=>\"[main]>worker0\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>25, \"name\"=>\"[main]>worker2\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>26, \"name\"=>\"[main]>worker3\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>27, \"name\"=>\"[main]>worker4\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>28, \"name\"=>\"[main]>worker5\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>29, \"name\"=>\"[main]>worker6\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>31, \"name\"=>\"[main]>worker8\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}]}}"}}
{"level":"WARN","loggerName":"logstash.shutdownwatcher","timeMillis":1493007470826,"thread":"Ruby-0-Thread-52: /usr/share/logstash/logstash-core/lib/logstash/shutdown_watcher.rb:31","logEvent":{"message":"{\"inflight_count\"=>751, \"stalling_thread_info\"=>{[\"LogStash::Filters::Metrics\", {\"add_field\"=>{\"[@metadata][self]\"=>\"true\", \"type\"=>\"metric\"}, \"add_tag\"=>\"metric\", \"meter\"=>[\"events\"], \"id\"=>\"5b1ebbd4498e61e6d13242fe3e7e1a935a4e99d5-4\"}]=>[{\"thread_id\"=>23, \"name\"=>\"[main]>worker0\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>25, \"name\"=>\"[main]>worker2\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>26, \"name\"=>\"[main]>worker3\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>27, \"name\"=>\"[main]>worker4\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>28, \"name\"=>\"[main]>worker5\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>29, \"name\"=>\"[main]>worker6\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}, {\"thread_id\"=>31, \"name\"=>\"[main]>worker8\", \"current_call\"=>\"[...]/vendor/bundle/jruby/1.9/gems/logstash-output-http-4.1.0/lib/logstash/outputs/http.rb:141:in `pop'\"}]}}"}}

This file has been truncated. show original

jbeck · May 12, 2017, 1:08am

Finally after a couple of weeks this has happened again and I was able to take a stack trace.

Stack dump from starting: https://ptpb.pw/nHyu
Stack dump after slowing down: https://ptpb.pw/kaxQ

guyboertje · May 12, 2017, 12:08pm

@jbeck @joshuaspence
I have an idea what is going on.

10 of your 12 worker threads are stuck in the http output. There is a delayed retry mechanism with which is executed via a scheduled timer task. Subsequent retry attempts affect the delay as a function of the attempt number - with any attempt above 7 having a delay of between 30 and 60 seconds.

The plugin defines these http response codes as retryable:

config :retryable_codes, :validate => :number, :list => true, :default => [429, 500, 502, 503, 504]

Any 200 -> 299 response code will be treated as a success.

There are some Manticore (the http client we use) Exceptions which are considered retryable:

  RETRYABLE_MANTICORE_EXCEPTIONS = [
    ::Manticore::Timeout,
    ::Manticore::SocketException,
    ::Manticore::ClientProtocolException, 
    ::Manticore::ResolutionFailure, 
    ::Manticore::SocketTimeout
  ]

If the Manticore Client received a response but the code is not considered a successful one then you should be seeing errors logged that look like this:

"Encountered non-2xx HTTP code #{response.code}",
            :response_code => response.code,
            :url => url,
            :event => event,
            :will_retry => will_retry

If the Manticore Client failed to get a response then you should be seeing errors logged that looked like this:

"Could not fetch URL",
                    :url => url,
                    :method => @http_method,
                    :body => body,
                    :headers => headers,
                    :message => exception.message,
                    :class => exception.class.name,
                    :backtrace => exception.backtrace,
                    :will_retry => will_retry

Please look at your logs to see what is happening with your http endpoint such that it causes retries.

joshuaspence · May 12, 2017, 12:41pm

Nope. The only thing in the logs is a whole lot of these:

{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494302705284,"thread":"[main]>worker11","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2648335}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494302797224,"thread":"[main]>worker9","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2648251}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494302886711,"thread":"[main]>worker11","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2626559}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494303062900,"thread":"[main]>worker0","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2626455}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494303190564,"thread":"[main]>worker10","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2648271}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494303390010,"thread":"[main]>worker0","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2648209}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494303399552,"thread":"[main]>worker11","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2648265}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494303453774,"thread":"[main]>worker0","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2626547}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494303476140,"thread":"[main]>worker9","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2626547}}

guyboertje · May 12, 2017, 4:04pm

Have you checked the older logs - if you are getting many sqs warnings the log file may have rolled?

What end point is your http output communicating with?

jbeck · May 15, 2017, 12:18am

Logstash is writing its own logs into our elasticsearch rig, and I can't find anything in there. It's being used as a consul health check. We're sending a single heartbeat event every 10 seconds, and the response time should be milliseconds.

We'll disable the HTTP output for now and see if it helps anything.

joshuaspence · June 1, 2017, 4:47am

We have disabled the HTTP output and (so far) haven't been having any further issues. Should this be filed as a bug on https://github.com/logstash-plugins/logstash-output-http?

guyboertje · June 2, 2017, 8:43am

You could create an issue in the http output.

Remember to state that your use case is one where the http output is secondary. If you suggest that the retry mechanism should be optional, you need to explain what you think the output should do with the events when this happens a) drop them, b) put them somewhere else e.g. Dead Letter Queue.

Topic		Replies	Views
Ingest of logs slows down over time Logstash	3	941	June 1, 2018
Logstash shipping issue Logstash	2	198	April 21, 2021
Logstash fell under load Logstash	15	642	May 10, 2021
Logstash slow Logstash	7	348	May 20, 2020
Logstash stop receiving or emitting events to Elasticsearch and sit idle after sometime Logstash	1	755	October 24, 2019

Logstash slows down over time

Related topics