Hi guys,
We've recently faced a very worrying situation on our semi-PROD server. We have a typical setup: Logstash tails our logs, parses and pre-processes them and then sends to self-hosted Elasticsearch server. We have shut down our Elasticsearch for a brief period of time and then started it again. At that moment Logstash started consuming resources aggressively to the semi-PROD server. From a brief investigations it seems that it was parsing/sending all log lines that had been written while our Elasticsearch was down (and hence Logstash failed to send them). Which made Logstash compete for resources with our semi-PROD services running on the same server.
What bothers me is that its not just sending pre-buffered events what seemed to be happening: I expect the send process be extremely cheap for Logstash, but CPU usage was ridiculously high for a sustained period of time. I.e. it seems that Logstash not just sending pre-buffered events, but also reading logs, parsing and processing them. So, does input/filter halt (instead of keeping working and appending intermediate events to some internal queue) when output is "clogged"?
Is there any way to just skip re-parsing/re-sending those "unsent" events in case our Elasticsearch or network connection goes down?
JFYI: we've already reduced Logstash's process priority and --pipeline-workers
.
Thanks,
-Andrey