Addressing Filebeat's memory leak and performance issues with high log volume

Adriann · March 8, 2024, 5:15pm

After encountering significant performance issues and memory leaks with Filebeat when processing high volumes of logs, especially with the Fortinet module enabled, I made several configuration changes to address the bottleneck caused by the in-memory queue.

The root cause seemed to be that events were being processed too slowly, regardless of the worker or other settings, leading to a backlog in the memory queue. To alleviate this, I switched to using the disk-based queue (queue.disk) instead of the in-memory queue (queue.mem). This change alone didn't provide a complete solution, but it did improve the throughput.

The problem with the memory queue was that events were flushed too slowly, irrespective of the worker configuration or other settings. Changing the values of output.elasticsearch.worker (5/10/20/40/100) had an unpredictable effect on performance. Increasing the number of workers did not significantly improve the results. This raises the question of whether the worker thread performance was limited by the number of CPU cores or threads available.

Reducing the queue.flush.timeout to 10ms(0/1/100) provided better results but did not ensure queue stability, as the queue size continued to grow(but slower with some configurations). Ultimately, switching to the disk-based queue resolved the problem.

After all the testes i got this:

output.elasticsearch:
...
  worker: 40
  bulk_max_size: 10000
  compression_level: 2

queue.disk:
  max_size: 15GB

Filebeat stopped consuming excessive amounts of memory. While it still consumes some memory, it no longer queues events in RAM and gets overwhelmed during log surges or restarts. Now, when there are more events to process, they are handled more efficiently using the disk-based queue.

With these changes, I observed the following improvements:

Filebeat's memory consumption stabilized, although it still uses some memory, but it no longer causes a memory leak.
The number of logs processed increased by approximately 40%.
Filebeat can now handle a stable peak throughput of 12,000 events/second(i guess if iops will be not a problem it should get to 30k/s on this 2 cpu 12 GB).
CPU usage increased by around 20% on a 2-core CPU(stable 60% now).
using filestream instead of input helped with stability
The write iops skyrocketed(it is an issue i will further investigate)

Filebeat

Elasticsearch

My remaining question is whether the number of processors and threads has any significance with this disk-based queue configuration, given the unpredictable effects observed when adjusting the worker thread count with the in-memory queue configuration?

Topic		Replies	Views
High CPU usage in BEATS Beats filebeat	5	1176	December 5, 2019
Filebeat performance stall sometimes Beats filebeat	17	2595	May 1, 2020
FileBeat slow - Improve performance Beats filebeat	12	20351	September 20, 2018
Filebeat - logstash performances troubleshooting Beats filebeat	15	4842	April 17, 2017
Out of memory issues with single log file Beats filebeat	3	1883	September 20, 2019

Addressing Filebeat's memory leak and performance issues with high log volume

Related topics