Filebeat - High CPU usage with kubernetes autodiscovery

lorenzo_milicia · June 30, 2025, 6:47am

Hello,

Over the last few days I've been trying to investigate an issue I was having in our logging system running in a production environment.
The system is a pretty straight forward Filebeat -> Logstash -> Elasticsearch pipeline.

Our numbers are roughly these:

40+ nodes (Filebeat running as DaemonSet on each node)
4-6 Logstash instances based on the load
5-10M logs processed per hour (Filebeat is set up to filter some, so there are actually more log lines in the .log files)

What we noticed was that some nodes were not able to keep up with the log ingestion, and suffered from constant CPU throttling, high memory usage and consequent OOM termination and slowness. This behaviour was observed with CPU requests set to 2.

Now to the interesting part:
The nodes that were experiencing these performance issues were part of a node group where we specifically run Kubernetes Jobs. This means that these nodes experience a high turnover of Pods and Containers. This was the only difference between these nodes and all other that were behaving fairly well.

Our filebeat configuration was like this:

logging.json: true
logging.metrics.enabled: false
filebeat.autodiscover:
  providers:
      - type: kubernetes
        node: ${NODE_NAME}
        templates:
          - condition.and:
            - not.contains:
                kubernetes.namespace: "namespace1"
            - not.contains:
                kubernetes.namespace: "namespace2"
            - not.contains:
                kubernetes.namespace: "namespace3"
            - not.contains:
                kubernetes.namespace: "namespace4"
            - not.contains:
                kubernetes.namespace: "namespace5"
            - not.contains:
                kubernetes.namespace: "namespace6"
            - not.contains:
                kubernetes.container.name: "container1"
            config:
            - type: container
              id: container-logs
              paths:
                - "/var/log/containers/*-${data.kubernetes.container.id}.log"
output.logstash:
  hosts: ["logstash-logging:5044"]
  ttl: 60s
  pipelining: 0

Which should be a fairly standard configuration.

I did some experiments to try and reduce the number of variables that could cause the issue, and I got these conclusions:

Logstash was NOT the bottleneck, since the performance issues were still present by setting the output to discard
The filters were NOT causing any significant slow down, since removing them did not improve the performance whatsoever
Migrating from the type container to filestream as suggested by the documentations did not improve the situation, even when trying every possible combination of clean_*, ignore_older etc.

What I noticed from enabling metrics and trying to correlate the performance with other variables, was that nodes with a relatively 'old age' (12h+) would suffer from the high CPU usage no matter what. I tried restarting the filebeat Pods and deleting the metadata files as well.
New nodes would start with a fair resource usage, that would increase with time and saturate the resources assigned to the pod at some point.

The only suspicious value from the metrics was a high number of active modules and goroutines.

My hypothesis is that the autodiscovery module itself, spawns A LOT of sub processors, one for each watched resource. These tend to accumulate with time, adding a lot of overhead, which is usually not noticeable, unless you have a high turnover rate, as in our case.

So I decided to try and get rid of the autodiscovery, since in our Kubernetes scenario we are fine with processing all /var/log/containers/*.log files from each node, with no fancy tracking logic. Of course we want to have the Kubernetes metadata added to the log.
What I ended up writing was this configuration file:

logging.json: true
logging.metrics.enabled: false
filebeat.inputs:
  - type: filestream
    id: static-containers-input
    prospector.scanner.symlinks: true
    take_over: true
    parsers:
      - container:
          stream: all
          format: cri
    paths:
      - "/var/log/containers/*.log"
processors:
  - add_kubernetes_metadata:
      in_cluster: true
      indexers:
        - container:
      matchers:
        - logs_path:
            logs_path: '/var/log/containers/'
            resource_type: 'container'
  - drop_event:
      when:
        or:
          - equals:
              kubernetes.namespace: "namespace1"
          - equals:
              kubernetes.namespace: "namespace2"
          - equals:
              kubernetes.namespace: "namespace3"
          - equals:
              kubernetes.namespace: "namespace4"
          - equals:
              kubernetes.namespace: "namespace5"
          - equals:
              kubernetes.namespace: "namespace6"
          - equals:
              kubernetes.container.name: "container1"
output.logstash:
  hosts: ["logstash-logging:5044"]
  ttl: 60s
  pipelining: 0

From my understanding, this configuration only runs one instance of the filestream module, and just adds metadata and performs filtering.

After switching to this configuration, resource consumption on our Filebeat pods dropped significantly, not only in our 'job' nodes. The resulting logs seem to match the old ones in terms of numbers and metadata, so I assume that the load is the same.

Now let's ask the important questions:

As far as you can tell, is my fixed configuration correct and should return the same logs that I would expect with the autodiscovery mechanics?
Are there any known issues with the autodiscovery mechanics, when there are a lot of containers being spawned over the course of node's lifetime?
Shouldn't the 'single module' configuration be the standard for a Kubernetes Filebeat deployment, when using DaemonSets? It seems that resource usage is significantly reduced with it, and I cannot see any downside to this approach. What value does the autodiscovery mechanism add?

Tortoise · July 1, 2025, 3:46am

Hello @lorenzo_milicia

Welcome to the community.

I am not sure if below blog can be helpful & are related to the issue you are facing :

github.com/elastic/elastic-agent-autodiscover

`add_resource_metadata.cronjob` overloads the memory usage

opened 09:35AM - 25 Oct 22 UTC

ChrsMark

bug enhancement Team:Cloudnative-Monitoring

As reported at https://github.com/elastic/beats/issues/33307, the kubernetes aut…odiscovery provider can lead to OOM kills for Beats Pods in clusters with specific type of workloads, ie Cronjobs. The purpose of this issue is the following: 1. Consider making `add_resource_metadata.cronjob: false` the default since we know it's an "expensive" feature. 2. Document the nature of this setting and its implications 3. Consider improving the way we retrieve the objects or get the "owner" name by trimming the suffix of the Object's name ie: `hello` (the name of the cronjob) out of `hello-1234` (the name of the job). For full context and previous analysis see the summary at https://github.com/elastic/beats/issues/33307#issuecomment-1290251145. cc: @eedugon @gizas @jsoriano

Thanks!!

lorenzo_milicia · July 1, 2025, 5:02am

Hi, and thank you for the reply!

Unfortunately these threads do not really help me. In my case the add_kubernetes_metadatada processor was not present in the original configuration, and adding it without using the autodiscovery does not show any sign of memory leaks.

Topic		Replies	Views
Filebeat: "memory leak" via filebeat.autodiscover and >>200.000 goroutines Beats filebeat	6	1714	January 31, 2023
Filbeat on Kubernetes/Autodiscover stops sending logs Beats filebeat	4	3124	December 19, 2018
Kubernetes autodiscover sending logs from only some of the identical nodes in one of our clusters Beats filebeat	9	1080	January 31, 2019
Filebeat doesn't log anythign with kubernetes autodiscover Beats filebeat	2	1371	September 17, 2018
Kubernetes autodiscover not shipping logs to logstash Beats filebeat	4	438	August 31, 2018

Filebeat - High CPU usage with kubernetes autodiscovery

Related topics