We have a log streaming pipeline setup with filebeat -> logstash -> Elasticsearch. Recently, our ES cluster started throwing 429 exceptions for all traffic and we observed that eventually logs stopped streaming to Logstash from filebeat. Our understanding is that the ES errors caused back-pressure in Logstash which eventually got propagated to filebeat, causing it to stall. Further investigation of the filebeat logs suggests that during the entire period ES was down, filebeat harvester open file count gradually increased to 48 (normally the value is 4-5 when everything is working fine) and stayed at such high value. Also, we saw that the for the entire duration (lasted almost 2 days) the libbeat.pipeline.events.active count was fixed at 4117. Can someone please explain why the open file count increased to such a high value and also why the libbeat.pipeline.events.active count was fixed at 4117? The filebeat client runs on the same node as our application. The application uses file rotation policy - we keep 10 rotated files, and the rotation is based on file size (10MB). We do generate a lot of logs and files rotate very frequently.
filebeat metric during the error period
filebeat metric when there was no error in ES:
Relevant input section in filebeat config: