I've got an ELK stack for development and all was well, up until recently. Not quite sure what's going on, but I'm pretty sure it's logstash getting creamed and not able to keep up.
The basics are I have an 8 node Elasticsearch 6.2.4 cluster, a 2 node Logstash cluster and a single Kibana host.
When I let Logstash run I see no entries in Kibana for the Filebeat indexes. All the syslog ones are fine.
If I bounce Logstash, the Filebeat indexes start showing data in the histogram, then peter out after a few minutes and I start seeing this in all the filbeat logs:
|2018-06-08T08:20:14.792-0700|ERROR|logstash/async.go:235|Failed to publish events caused by: read tcp 172.x.x.251:56820->172.x.x.246:5044: i/o timeout| |---|---|---|---| |2018-06-08T08:20:14.843-0700|ERROR|logstash/async.go:235|Failed to publish events caused by: client is not connected| |2018-06-08T08:20:14.887-0700|ERROR|logstash/async.go:235|Failed to publish events caused by: client is not connected| |2018-06-08T08:20:14.953-0700|ERROR|logstash/async.go:235|Failed to publish events caused by: client is not connected| |2018-06-08T08:20:15.844-0700|ERROR|pipeline/output.go:92|Failed to publish events: client is not connected| |2018-06-08T08:20:15.887-0700|ERROR|pipeline/output.go:92|Failed to publish events: client is not connected| |2018-06-08T08:20:15.953-0700|ERROR|pipeline/output.go:92|Failed to publish events: client is not connected|
I am able to telnet from the same host to that port, so it's not connectivity. I've also played with Filebeat and tried to change max_bulk_size up and down to no avail.
When I tail the logstash logs I see the events coming in in the logs
I guess my question is how do I tune logstash? I increased Java heap and set pipeline settings in logstash but I'm seeing no difference? I'm monitoring the in/out events and heap usage but I don't know what's healthy or not.
Each logstash server shows the in and out events just going up and up and up. Starting around 1 million and after a while getting upwards of 10 million.
My heap usage fluctuates anywhere between 25-80% going up and down.
I'm not entirely sure what I need to do to relieve the pressure, other than scale the LS cluster horizontally, but if I'd rather understand how to tune and troubleshoot the service properly.
Let me know where you think I should start!