Filebeat harvester open file count increasing when Logstash output to Elasticsearch Fails

We have a log streaming pipeline setup with filebeat -> logstash -> Elasticsearch. Recently, our ES cluster started throwing 429 exceptions for all traffic and we observed that eventually logs stopped streaming to Logstash from filebeat. Our understanding is that the ES errors caused back-pressure in Logstash which eventually got propagated to filebeat, causing it to stall. Further investigation of the filebeat logs suggests that during the entire period ES was down, filebeat harvester open file count gradually increased to 48 (normally the value is 4-5 when everything is working fine) and stayed at such high value. Also, we saw that the for the entire duration (lasted almost 2 days) the libbeat.pipeline.events.active count was fixed at 4117. Can someone please explain why the open file count increased to such a high value and also why the libbeat.pipeline.events.active count was fixed at 4117? The filebeat client runs on the same node as our application. The application uses file rotation policy - we keep 10 rotated files, and the rotation is based on file size (10MB). We do generate a lot of logs and files rotate very frequently.

filebeat metric during the error period


filebeat metric when there was no error in ES:

Relevant input section in filebeat config:

Unfortunately, at the moment the way Filebeat handles backpressure is not ideal. It holds all files it cannot forward to the output, until the output comes alive again. This lets Filebeat save the contents of as many files as possible. However, it might lead to memory issues on the host by keeping too many files open.

If you have these issues you can give a lifetime to readers by configuring close_timeout. See more: https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-log.html#filebeat-input-log-close-timeout

Hi Noemi @kvch ,

Thanks a lot for your reply. Could you please explain at present till what limit filebeat will try to save files it cannot forward? Is there a configuration parameter for that or will it keeping saving till memory runs out? Also, could you please explain the behavior of libbeat.pipeline.events.active metric? During the ES outage the value was fixed at 4117 at filebeat? Does it correspond to internal queue of filebeat - https://www.elastic.co/guide/en/beats/filebeat/master/configuring-internal-queue.html? Should we interpret as that 4117 was the max size of the queue and all events beyond that was dropped?

Thanks,
Arijit

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.