We were running logstash 5.6.1, and we noticed it stopped listening on IP.
So we upgraded it to logstash 5.6.4
And here I am , a few days later, and logstash stopped listening, again, on IP. Clients cannot connect to push log messages.
The logstash log gets this entry every few seconds:
[2017-11-17T13:57:44,949][ERROR][logstash.outputs.elasticsearch] Encountered a retryable error. Will Retry with exponential backoff {:code=>400, :url=>"http://<IP of ES endpoint>:9200/_bulk"}
The Logstash daemon is running, and the logstash monitoring endpoint is listening (I can run commands against port 9600 fine), but my profile endpoint is not accepting new connections (clients fail to connect to the socket it should be listening to).
I end up having to stop and start the logstash daemon to get it going (which is clearly not great!). Note, I only cycle logstash, so it does not appear the problem is in ES.
What does this mean?
What do I do next?
The output from a bunch of monitor api commands is here: https://pastebin.com/Aysn229G . These were done while logstash was in the problem state.
ES is rejecting documents, typically because they're malformed or are incompatible with the existing mapping. The Logstash log should contain additional information about this. Perhaps the dead letter queue feature could help you capture the rejected documents?
The reason I have not, is that the symptom is clearly that Logstash has died. Consider:
It is not that some messages are being forwarded to ES by Logstash. Rather, when this condition arises, zero messages are forwarded to ES. (Logstash is on the receiving end of 10-1000 messages per second coming in via TCP, from eight or ten app servers)
The entire situation is resolved by cycling the logstash daemon. ES is not cycled. Cycling Logstash instantly changes the system state from Logstash forwarding zero messages to Logstash forwarding all messages. This indicates that the Logstash process is in a damaged state.
I will now set up DLQ, but from the evidence, there is no reason to believe the DLQ will give useful info. (Also, other posters with the same issue report nothing in their DLQs)
We can say authoritatively that the problem is NOT ES related. It is pure logstash failure.
This thread reflects where we have taken this so far (and the other poster is not us, it is another user with the same problem we have, but only today did we get another failure, and more log data, including:
[io.netty.channel.DefaultChannelPipeline] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.io.IOException: Too many open files
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.