Filebeat 8.12.2 (with kubernetes hints-based autodiscover and kafka output) apparently leaks memory

Hi.

We're running the following configuration of filebeat

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: REDACTED-filebeat-filebeat
spec:
  ... REDACTED
  template:
    metadata: REDATCED
    spec:
      containers:
      - args:
        - -e
        - -E
        - http.enabled=true
        env:
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        image: docker.elastic.co/beats/filebeat:8.12.2
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          privileged: false
          runAsUser: 0
        volumeMounts:
        - mountPath: /usr/share/filebeat/filebeat.yml
          name: filebeat-config
          readOnly: true
          subPath: filebeat.yml
        - mountPath: /usr/share/filebeat/data
          name: data
        - mountPath: /var/lib/docker/containers
          name: varlibdockercontainers
          readOnly: true
        - mountPath: /var/log
          name: varlog
          readOnly: true
        - mountPath: /var/run/docker.sock
          name: varrundockersock
          readOnly: true
      volumes:
      - configMap:
          name: REDACTED-filebeat-filebeat-daemonset-config
        name: filebeat-config
      - hostPath:
          path: /var/lib/REDACTED-filebeat-filebeat-logging-data
          type: DirectoryOrCreate
        name: data
      - hostPath:
          path: /var/lib/docker/containers
          type: ""
        name: varlibdockercontainers
      - hostPath:
          path: /var/log
          type: ""
        name: varlog
      - hostPath:
          path: /var/run/docker.sock
          type: ""
        name: varrundockersock
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 50%
    type: RollingUpdate

With the following config:

filebeat.autodiscover:
  providers:
  - type: kubernetes
    hints:
      enabled: true
      default_config:
        type: filestream
        id: kubernetes-container-logs-${data.kubernetes.pod.name}-${data.kubernetes.container.id}
        take_over: true
        enabled: false
        paths:
        - /var/log/containers/*-${data.kubernetes.container.id}.log  # CRI path
        parsers:
        - container: 
            stream: all
            format: auto
        prospector:
          scanner:
            symlinks: true
    add_ressource_metadata: # Considers namespace annotations for hints
      deployment: false
      cronjob: false
      namespace:
        include_annotations:
          - "nsannotations1"
processors:
- rename:
    ignore_missing: true
    fail_on_error: false
    fields:
    - from: kubernetes.labels.REDACTED
      to: REDACTED
    - from: kubernetes.labels.REDACTED
      to: REDACTED
    - from: kubernetes.labels.REDACTED
      to: REDACTED
    - from: kubernetes.namespace_labels.REDACTED
      to: REDACTED
    - from: kubernetes.namespace_labels.REDACTED
      to: REDACTED
    - from: kubernetes.namespace_labels.REDACTED
      to: REDACTED
- copy_fields:
    fields:
    - from: REDACTED
      to: REDACTED
    when:
      not:
        has_fields:
        - REDACTED
output:
  kafka:
    version: 2
    hosts:
      - REDACTED:9093
      - REDACTED:9093
    topic: 'REDACTED-%{[REDACTED]:REDACTED}-%{[REDACTED]:REDACTED}'
    ssl:
      certificate_authorities:
      - |
        REDACTED

On that K8S cluster, there is only one test pod using filebeat for log collections, which logs 2 messages
every 30 seconds.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: test-filebeat
  name: test-filebeat
spec:
  selector:
    matchLabels:
      app: test-filebeat
  template:
    metadata:
      annotations:
        co.elastic.logs/enabled: "true"
      labels:
        app: test-filebeat
        REDACTED
    spec:
      containers:
      - command:
        - /bin/sh
        - -c
        - "while true\ndo\n  echo \"Bip bop I'm a Pod, my name is ${NAME} and I'm
          testing filebeat (with kafka output) log collection\" \n  date\n  sleep
          ${NAP_SECONDS}\ndone\n"
        env:
        - name: NAP_SECONDS
          value: "30"
        - name: NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: alpine

Filebeat appears to steadily consume more memory, which leads me to conclude to a leak somewhere.

Here is the graph of the memory (precisely the metric is container_memory_working_set_bytes) for the filebeat container which collect that logs (reminder: 2 events each 30 seconds). (middle line in the graph, the other are the other filebeat container, not collecting anything)

As you can see, at first it stays in the range 200Mi / 400 Mi (which is already fairly high) but then grows to a minimum of approximately 300Mi, with peaks as high as 700Mi.

The last 24h were eventful, as our control plane had some restart due to the underlying inftrastructure, which cause filebeat to consume still more memory and be OOMkilled twice :

(I can't post two graphs at once being a new user, I'll add it in a comment)

While this is unusual (the control-plane in bad shape), I don't think that should cause filebeat to explode as well, especially since that means we could loose control-plane logs when we really need them (we only use filebeat for applications logs currently, but intent to send kubernetes logs as well soon).

I've noted a minor inconsistency: looking a filebeat logs, which periodically announces some metrics, I've seen memstats.memory_alloc, which is substantially lower than container_memory_working_set_bytes in my monitoring. I think they are not measuring exactly the same thing ?

I needed, I can enabled pprof and run that for a few days to collect more data.

If that's not a leak, any ideas on how to prevent filebeat from growing its memory usage like this ?

Thanks for reading me.

Logs: I don't have logs for the filebeat which exhibited the first behavior, as they were two restart due to OOM. However I don't noticed anything out of the ordinary while looking at it, mostly "Metrics" lines. The last one on the current running pod are :

{"log.level":"info","@timestamp":"2024-03-12T11:30:28.162Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpu":{"stats":{"periods":41,"throttled":{"ns":615030897,"periods":1}}},"cpuacct":{"total":{"ns":628925660}},"memory":{"mem":{"usage":{"bytes":331001856}}}},"cpu":{"system":{"ticks":61130,"time":{"ms":10}},"total":{"ticks":953000,"time":{"ms":330},"value":953000},"user":{"ticks":891870,"time":{"ms":320}}},"handles":{"limit":{"hard":65535,"soft":65535},"open":17},"info":{"ephemeral_id":"cbbf9d34-2ad6-4ec7-9d55-8c311d3e0763","uptime":{"ms":44670057},"version":"8.12.2"},"memstats":{"gc_next":257757528,"memory_alloc":150620368,"memory_total":331671957272,"rss":397660160},"runtime":{"goroutines":85}},"filebeat":{"events":{"active":0,"added":2,"done":4},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"acked":4,"active":0,"batches":2,"total":4}},"outputs":{"kafka":{"bytes_read":312,"bytes_write":3822}},"pipeline":{"clients":1,"events":{"active":0,"published":2,"total":2},"queue":{"acked":4}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.02,"15":0.16,"5":0.16,"norm":{"1":0.0025,"15":0.02,"5":0.02}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2024-03-12T11:30:58.162Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpu":{"stats":{"periods":34}},"cpuacct":{"total":{"ns":676148123}},"memory":{"mem":{"usage":{"bytes":344637440}}}},"cpu":{"system":{"ticks":61150,"time":{"ms":20}},"total":{"ticks":953390,"time":{"ms":390},"value":953390},"user":{"ticks":892240,"time":{"ms":370}}},"handles":{"limit":{"hard":65535,"soft":65535},"open":17},"info":{"ephemeral_id":"cbbf9d34-2ad6-4ec7-9d55-8c311d3e0763","uptime":{"ms":44700056},"version":"8.12.2"},"memstats":{"gc_next":263442976,"memory_alloc":188756712,"memory_total":331816014528,"rss":410624000},"runtime":{"goroutines":85}},"filebeat":{"events":{"active":2,"added":2},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0}},"pipeline":{"clients":1,"events":{"active":2,"published":2,"total":2}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.01,"15":0.15,"5":0.14,"norm":{"1":0.0013,"15":0.0188,"5":0.0175}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2024-03-12T11:31:28.161Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpu":{"stats":{"periods":32,"throttled":{"ns":436049862,"periods":1}}},"cpuacct":{"total":{"ns":610060642}},"memory":{"mem":{"usage":{"bytes":314544128}}}},"cpu":{"system":{"ticks":61180,"time":{"ms":30}},"total":{"ticks":953720,"time":{"ms":330},"value":953720},"user":{"ticks":892540,"time":{"ms":300}}},"handles":{"limit":{"hard":65535,"soft":65535},"open":17},"info":{"ephemeral_id":"cbbf9d34-2ad6-4ec7-9d55-8c311d3e0763","uptime":{"ms":44730057},"version":"8.12.2"},"memstats":{"gc_next":232351944,"memory_alloc":192536696,"memory_total":331941757520,"rss":380456960},"runtime":{"goroutines":85}},"filebeat":{"events":{"active":0,"added":2,"done":4},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"acked":4,"active":0,"batches":2,"total":4}},"outputs":{"kafka":{"bytes_read":312,"bytes_write":3813}},"pipeline":{"clients":1,"events":{"active":0,"published":2,"total":2},"queue":{"acked":4}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0,"15":0.14,"5":0.13,"norm":{"1":0,"15":0.0175,"5":0.0163}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2024-03-12T11:31:58.161Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpu":{"stats":{"periods":40,"throttled":{"ns":721637158,"periods":2}}},"cpuacct":{"total":{"ns":761081044}},"memory":{"mem":{"usage":{"bytes":324538368}}}},"cpu":{"system":{"ticks":61190,"time":{"ms":10}},"total":{"ticks":954180,"time":{"ms":460},"value":954180},"user":{"ticks":892990,"time":{"ms":450}}},"handles":{"limit":{"hard":65535,"soft":65535},"open":17},"info":{"ephemeral_id":"cbbf9d34-2ad6-4ec7-9d55-8c311d3e0763","uptime":{"ms":44760057},"version":"8.12.2"},"memstats":{"gc_next":254664584,"memory_alloc":131239520,"memory_total":332087599968,"rss":391262208},"runtime":{"goroutines":85}},"filebeat":{"events":{"active":0,"added":2,"done":2},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"acked":2,"active":0,"batches":1,"total":2}},"outputs":{"kafka":{"bytes_read":156,"bytes_write":1912}},"pipeline":{"clients":1,"events":{"active":0,"published":2,"total":2},"queue":{"acked":2}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.16,"15":0.15,"5":0.15,"norm":{"1":0.02,"15":0.0188,"5":0.0188}}}},"ecs.version":"1.6.0"}}
{"log.level":"info","@timestamp":"2024-03-12T11:32:28.161Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpu":{"stats":{"periods":32,"throttled":{"ns":531048983,"periods":1}}},"cpuacct":{"total":{"ns":685457701}},"memory":{"mem":{"usage":{"bytes":338583552}}}},"cpu":{"system":{"ticks":61220,"time":{"ms":30}},"total":{"ticks":954590,"time":{"ms":410},"value":954590},"user":{"ticks":893370,"time":{"ms":380}}},"handles":{"limit":{"hard":65535,"soft":65535},"open":17},"info":{"ephemeral_id":"cbbf9d34-2ad6-4ec7-9d55-8c311d3e0763","uptime":{"ms":44790058},"version":"8.12.2"},"memstats":{"gc_next":275733944,"memory_alloc":169929744,"memory_total":332231794168,"rss":405037056},"runtime":{"goroutines":85}},"filebeat":{"events":{"active":2,"added":2},"harvester":{"open_files":1,"running":1}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0}},"pipeline":{"clients":1,"events":{"active":2,"published":2,"total":2}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.1,"15":0.14,"5":0.13,"norm":{"1":0.0125,"15":0.0175,"5":0.0163}}}},"ecs.version":"1.6.0"}}

I do have some errors logs related to the filestream id which could duplicates, but I don't have duplicates in the output.

{"log.level":"error","@timestamp":"2024-03-12T10:40:44.314Z","log.logger":"input","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.(*InputManager).Create","file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'kubernetes-container-logs-test-filebeat-55795d7b56-f6qvk-445a87fc969aa48e3bbc6dd5e60c318f2265f768c7f2ef8ef9c067df81685228' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}

And finally I got a trace : 2024-03-12T06:44:22.493204589+01:00 Trace[445770726]: ---"Objects listed" error:<nil> 45067ms (05:44:22.492) , not sure where this one came from...

The second memory graph (with 2 OOM of filebeat):

Possibly relevant : Filebeat pods keeps increasing Memory usage - #12 by oandre7

Some news: this behavior is apparently cluster specific. I deployed the exact same config on bigger cluster with more activity, and memory usage stay around 150 MiB even with normale activity (pre-prod environment, so not a lot , but still more that what's on my first cluster).

Here is a heap profile on one of the filebeat demonstrating the high memory usage :

podEvents and containerPodEvents seems to take a lot :thinking:

Interestingly, the high memory usage persist when I remove the elastic annotations from the pod, disabling log collection.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.