For a while now, we've seen one of our Logstash instances periodically stalling, and it stops sending events through our outputs.
It gets into a weird state where it seems like it's running just fine (
sudo service logstash status returns
logstash is running) and the log file doesn't show anything special. The only way we know something is wrong is because we stop seeing events being processed. This includes the heartbeat we have setup.
It's also odd that when it gets like this, we try running
sudo service logstash stop and then it fails to stop within 10 seconds (i.e. it prints out
logstash stop failed; still running.). But if we check the logs at this point, we'll see an error get printed every few seconds. Here's a pastebin of one of the error messages. The only way we can get it to stop is by running
sudo service logstash force-stop, but I'm pretty sure that means we're losing some events every time we do that.
I'm really just wondering what is happening to get Logstash into this state, and what can I do to debug it further (even if it's to get error logs). I thought Logstash was able to handle errors from the outputs and keep running. Is there something that we're going wrong? The only outputs we're using are
redis (though this is for certain failure cases and don't think this output is being used often, if at all).
We have Logstash v2.3.4 running on an EC2 c4.xlarge (4 vCPU, 7.5GB Memory) with
Thanks in advance!