Logstash slows down over time

I'm having a weird problem with logstash and I'd like some help debugging it.

Basically our logstash cluster slows down over a period of a couple of weeks. What I mean by slows down is that it starts rejecting messages from filebeat. Ultimately this is causing filebeat on a lot of our hosts to take up a lot of memory since a lot of their log entries are not getting through. I have been able to saturate our CPUs with enough log messages. When it slows down the CPU usage drops to less than 50%.

Restarting logstash on the hosts seems to fix it. The CPU usage goes up to 100%, message throughput more than doubled. Then our autoscaling policies kicked in and added a few more hosts. The throughput with a few more hosts ended up being almost triple the original.

Let me know any details that seem relevant and I will expand on how we have things set up.

Half an hour later the throughput is 5 times what it was before I started, and there are a whole heap of hosts handling the load. So there were definitely messages available.

What version? LS and java.
What OS?
What's your config look like?

Logstash 5.3. Java is openjdk 1.8. OS Ubuntu 16.04.

Do you mean the logstash configuration. It's pretty standard. These hosts are filebeat input to SQS output. All they do is read messages and write them to SQS, there are no filters.

This is a hard question.

Take a thread dump about 2 minutes from LS start - this will be a baseline.

Can you use the monitoring API to periodically poll the stats to see when the slowdown occurs?
Then you can:

  • look at the log files around that time to see if there are errors
  • do a JVM thread dump so we can see what LS is doing internally.

zip up the thread dumps and post them here.

https://helpx.adobe.com/experience-manager/kb/TakeThreadDump.html

Thanks I will do that. This might take a while since I have to hope that the instance I take the reference dump on doesn't get shut down due to autoscaling.

I have two thread dumps, but I am only allowed to upload images.

If this helps, we're actually connecting filebeat to logstash via a load balancer. The logs and tcpdump from the load balancer indicate that logstash is sending TCP resets after running for only a couple of minutes.

It seems like this was solved by turning off publish_async on filebeat.

However it still begs the question as to why I see so many errors from logstash. There are a lot of ERR Failed to publish events caused by: EOF in the filebeat logs. I also see a lot of random resets coming from logstash.

Firewall, load balancer?

I think I ruled out the load balancers, and I'm pretty sure I saw the same thing before we started using load balancers. There are no relevant firewalls, only AWS security groups.

Watching the traffic from the load balancer it looks like the connection resets are coming from the logstash hosts.

The other strange thing is that there are a lot errors for a half a minute when I start filebeat. After that it seems ok for a while.

Even with publish_async: false there is still a problem. Filebeat is no longer taking up 1.5G of memory on our hosts, but logstash still slows down after a couple of weeks. I noticed that filebeat was several log rotations behind on several hosts, and restarting logstash fixed it.

Any suggestions on how to upload those stack dumps?

Github gist pastebin etc

I work with @jbeck and can provide some additional information. I don't know if it's helpful or not, but here's the last 100 lines from the Logstash log before I restarted the service:

Finally after a couple of weeks this has happened again and I was able to take a stack trace.

Stack dump from starting: https://ptpb.pw/nHyu
Stack dump after slowing down: https://ptpb.pw/kaxQ

@jbeck @joshuaspence
I have an idea what is going on.

10 of your 12 worker threads are stuck in the http output. There is a delayed retry mechanism with which is executed via a scheduled timer task. Subsequent retry attempts affect the delay as a function of the attempt number - with any attempt above 7 having a delay of between 30 and 60 seconds.

The plugin defines these http response codes as retryable:

config :retryable_codes, :validate => :number, :list => true, :default => [429, 500, 502, 503, 504]

Any 200 -> 299 response code will be treated as a success.

There are some Manticore (the http client we use) Exceptions which are considered retryable:

  RETRYABLE_MANTICORE_EXCEPTIONS = [
    ::Manticore::Timeout,
    ::Manticore::SocketException,
    ::Manticore::ClientProtocolException, 
    ::Manticore::ResolutionFailure, 
    ::Manticore::SocketTimeout
  ]

If the Manticore Client received a response but the code is not considered a successful one then you should be seeing errors logged that look like this:

"Encountered non-2xx HTTP code #{response.code}",
            :response_code => response.code,
            :url => url,
            :event => event,
            :will_retry => will_retry

If the Manticore Client failed to get a response then you should be seeing errors logged that looked like this:

"Could not fetch URL",
                    :url => url,
                    :method => @http_method,
                    :body => body,
                    :headers => headers,
                    :message => exception.message,
                    :class => exception.class.name,
                    :backtrace => exception.backtrace,
                    :will_retry => will_retry

Please look at your logs to see what is happening with your http endpoint such that it causes retries.

Nope. The only thing in the logs is a whole lot of these:

{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494302705284,"thread":"[main]>worker11","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2648335}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494302797224,"thread":"[main]>worker9","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2648251}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494302886711,"thread":"[main]>worker11","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2626559}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494303062900,"thread":"[main]>worker0","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2626455}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494303190564,"thread":"[main]>worker10","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2648271}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494303390010,"thread":"[main]>worker0","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2648209}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494303399552,"thread":"[main]>worker11","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2648265}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494303453774,"thread":"[main]>worker0","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2626547}}
{"level":"WARN","loggerName":"logstash.outputs.sqs","timeMillis":1494303476140,"thread":"[main]>worker9","logEvent":{"message":"Message exceeds maximum length and will be dropped","message_size":2626547}}

Have you checked the older logs - if you are getting many sqs warnings the log file may have rolled?

What end point is your http output communicating with?

Logstash is writing its own logs into our elasticsearch rig, and I can't find anything in there. It's being used as a consul health check. We're sending a single heartbeat event every 10 seconds, and the response time should be milliseconds.

We'll disable the HTTP output for now and see if it helps anything.

We have disabled the HTTP output and (so far) haven't been having any further issues. Should this be filed as a bug on https://github.com/logstash-plugins/logstash-output-http?

You could create an issue in the http output.

Remember to state that your use case is one where the http output is secondary. If you suggest that the retry mechanism should be optional, you need to explain what you think the output should do with the events when this happens a) drop them, b) put them somewhere else e.g. Dead Letter Queue.