Logstash is overloaded?

Hey all,

I've got an ELK stack for development and all was well, up until recently. Not quite sure what's going on, but I'm pretty sure it's logstash getting creamed and not able to keep up.

The basics are I have an 8 node Elasticsearch 6.2.4 cluster, a 2 node Logstash cluster and a single Kibana host.

When I let Logstash run I see no entries in Kibana for the Filebeat indexes. All the syslog ones are fine.

If I bounce Logstash, the Filebeat indexes start showing data in the histogram, then peter out after a few minutes and I start seeing this in all the filbeat logs:

|2018-06-08T08:20:14.792-0700|ERROR|logstash/async.go:235|Failed to publish events caused by: read tcp 172.x.x.251:56820->172.x.x.246:5044: i/o timeout|
|---|---|---|---|
|2018-06-08T08:20:14.843-0700|ERROR|logstash/async.go:235|Failed to publish events caused by: client is not connected|
|2018-06-08T08:20:14.887-0700|ERROR|logstash/async.go:235|Failed to publish events caused by: client is not connected|
|2018-06-08T08:20:14.953-0700|ERROR|logstash/async.go:235|Failed to publish events caused by: client is not connected|
|2018-06-08T08:20:15.844-0700|ERROR|pipeline/output.go:92|Failed to publish events: client is not connected|
|2018-06-08T08:20:15.887-0700|ERROR|pipeline/output.go:92|Failed to publish events: client is not connected|
|2018-06-08T08:20:15.953-0700|ERROR|pipeline/output.go:92|Failed to publish events: client is not connected|

I am able to telnet from the same host to that port, so it's not connectivity. I've also played with Filebeat and tried to change max_bulk_size up and down to no avail.

When I tail the logstash logs I see the events coming in in the logs

I guess my question is how do I tune logstash? I increased Java heap and set pipeline settings in logstash but I'm seeing no difference? I'm monitoring the in/out events and heap usage but I don't know what's healthy or not.

Each logstash server shows the in and out events just going up and up and up. Starting around 1 million and after a while getting upwards of 10 million.

My heap usage fluctuates anywhere between 25-80% going up and down.

I'm not entirely sure what I need to do to relieve the pressure, other than scale the LS cluster horizontally, but if I'd rather understand how to tune and troubleshoot the service properly.

Let me know where you think I should start!

Thanks!

Soooo, right after I posted this I changed the Filebeat bulk_max_size to 4096 just for giggles, and I can see the data coming in without stopping. I guess this could have been a filebeat thing then? That said, I'll keep an eye on it and see if it peters out again.

Thanks!

What versions are using for Filebeat and Logstash?

I'd make sure you have the latest version of the beats input plugin installed in Logstash. (update instructions) (changelog)

Thanks Andrew. I'll check that out.

It does appear that data is still coming in so that change appears to have resolved the issue.

IIRC I can only use the plugin frame work if I've compiled from source? I use docker/repo packages.

If I can great, I'll definitely update. I'll check it out either way.

Also I have a quick question about The API stats. Is the events in/out from pipeline stats cumulative?

"pipeline": {
    "events": {
        "duration_in_millis": 93726814,
        "filtered": 26448856,
        "in": 26448856,
        "out": 26448856,
        "queue_push_duration_in_millis": 9758765

If so I should probably set my monitor to use this as a delta as I'm not getting a good idea of in/out per minute.

logstash%20events

You can use LS plugins with any release. Internally all of the standard inputs/filters/outputs are plugins themselves. These plugins are just bundled in by default with LS.

I believe those are cumulative. So doing a derivative will be needed to get rates. I'm pretty sure the x-pack monitoring UI for Logstash uses a derivative aggregation to display these values.

BTW The monitoring feature is included in the x-pack basic license which is free. And it includes some of these metrics plus the pipeline viewer.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.