Potential memory leak issue with filebeat and metricbeat

kbujold_wr · May 25, 2023, 4:37pm

Since upgrading from ELK 7.17.1 to ELK 8.6.2 (and even with ELK 8.7.1) we are experiencing OOMKilled on filebeat and metricbeat pods. We had no issues with ELK 7.17.1. Increasing the resources allocations does not resolve the issue and simply delays the crash. This appears to be a memory leak issue with beats.

    State:          Running
      Started:      Thu, 25 May 2023 15:18:43 +0000
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 25 May 2023 02:53:22 +0000
      Finished:     Thu, 25 May 2023 15:18:41 +0000
    Ready:          True

This is an example of our filebeat pod memory in the past 24 hours

We have tried this config which was mentioned in other posts but it makes no differences. We also do not use cron jobs.

        processors:
          - add_kubernetes_metadata:
              add_resource_metadata:
                deployment: false
                cronjob: false

kbujold_wr · May 30, 2023, 1:36pm

We are also seeing this in ELK 8.8.0

cmacknz · May 30, 2023, 7:30pm

Can you post your complete Filebeat configuration?

What we need is heap profiles from Filebeat which should tell us what is using the memory. The instructions to do this are:

Start the Beat process with httpprof (profiling) enabled. This allows us to easily extract memory profiles of the running process. Add these configuration options:

http.host: localhost
http.port: 6060
http.pprof.enabled: true

Once Beats is started and is done initializing (after 5-10 minutes), you can collect the first memory dump via a simple curl command like this: curl -s -v http://localhost:8080/debug/pprof/heap > heap_normal.bin.
Once you start noticing that the process is taking excessive amounts of memory, a second dump needs to be generated like curl -s -v http://localhost:8080/debug/pprof/heap > heap_high.bin.

If you attach the .bin files we can analyze them to see what is going on. The profile taken when the memory usage is excessive is the most important one.

kbujold_wr · June 14, 2023, 7:55pm

I am unable to reply with the .bin attachments. How would you like me to share?

stephenb · June 14, 2023, 7:56pm

@kbujold_wr You can use pastebin

kbujold_wr · June 14, 2023, 8:28pm

pastebin is blocked at my workplace. Can you access here?
https://drive.google.com/drive/folders/1K-gtu_mcpLw6M3ocg8nDbtk0j8SkJBVs?usp=sharing

Below is our config map for filebeat. Note we have the same issue with metricbeat.

Name:         mon-filebeat-daemonset-config
Namespace:    monitor
Labels:       app=mon-filebeat
              app.kubernetes.io/managed-by=Helm
              chart=filebeat-8.5.1
              helm.toolkit.fluxcd.io/name=filebeat
              helm.toolkit.fluxcd.io/namespace=monitor
              heritage=Helm
              release=mon-filebeat
Annotations:  meta.helm.sh/release-name: mon-filebeat
              meta.helm.sh/release-namespace: monitor
              remoteconfigchecksum: 12341234abcdabcd

Data
====
filebeat.yml:
----
fields:
  system:
    name: yow-cgcs-supermicro-2
    uid: d1374af9-1e53-4339-92be-974f1b033347
fields_under_root: true
filebeat.autodiscover:
  providers:
  - hints.default_config:
      close_renamed: true
      paths:
      - /var/log/containers/*-${data.kubernetes.container.id}.log
      type: container
    hints.enabled: true
    host: ${NODE_NAME}
    type: kubernetes
filebeat.inputs:
- close_timeout: 5m
  enabled: true
  exclude_files:
  - ^/var/log/containers/
  - ^/var/log/pods/
  paths:
  - /var/log/*.log
  - /var/log/messages
  - /var/log/syslog
  - /var/log/**/*.log
  type: log
http.port: 5066
monitoring:
  cluster_uuid: ${CLUSTER_UUID}
  elasticsearch:
    hosts:
    - https://mon-elasticsearch-client:9200
    password: ${beats_system_monitoring_password}
    ssl.certificate_authorities:
    - /usr/share/filebeat/ext-ca.crt
    username: ${beats_system_monitoring_user}
  enabled: ${BEAT_MONITORING_ENABLED}
name: ${NODE_NAME}
output.elasticsearch:
  enabled: false
  host: ${NODE_NAME}
  hosts:
  - https://mon-elasticsearch-client:9200
  ilm.pattern: "000001"
  index: ${INDEX_NAME}-%{+yyyy.MM.dd}
  password: ${ELASTICSEARCH_PASSWORD}
  protocol: https
  ssl.certificate_authorities:
  - /usr/share/filebeat/ext-ca.crt
  username: ${ELASTICSEARCH_USERNAME}
output.file:
  enabled: false
output.logstash:
  enabled: true
  hosts:
  - mon-logstash:5044
  ssl.certificate: /usr/share/filebeat/config/instance/filebeat.crt
  ssl.certificate_authorities:
  - /usr/share/filebeat/ca.crt
  - /usr/share/filebeat/previous/ca.crt
  - /usr/share/filebeat/next/ca.crt
  ssl.key: /usr/share/filebeat/config/instance/filebeat.key
  timeout: 9
processors:
- add_kubernetes_metadata:
    annotations.dedot: true
    default_matchers.enabled: false
    labels.dedot: true
setup.dashboards:
  enabled: false
setup.kibana:
  host: mon-kibana:5601
  password: ${filebeat_password}
  protocol: https
  ssl.certificate_authorities:
  - /usr/share/filebeat/ext-ca.crt
  ssl.verification_mode: none
  username: ${filebeat_user}
setup.template:
  name: ${INDEX_NAME}
  pattern: ${INDEX_PATTERN}


BinaryData
====

Events:  <none>

stephenb · June 14, 2023, 11:46pm

@cmacknz ^^^

cmacknz · June 15, 2023, 9:06pm

Here's what I see in the profiles, looking specifically at the heap in use space profile

This is the normal case, nothing seems like it dominates it:

Here's the case where the heap is high, the heap is dominated by allocations from the Kubernetes watch API. This could mean a few things, one is that it could be receiving very large responses from the Kubernetes API, but it could also mean that we are subscribing to too many events and the rate of allocations is too high.

This isn't something I have an immediate answer for. It seems similar to other cases I have seen before but it looks like we are still working on a fix.

This is the first report, the entire discussion does have some comments with possible work arounds: [Metricbeat] Possible memory leak with autodiscover · Issue #33307 · elastic/beats · GitHub

This is the follow up issue, which is scoped specifically to CronJobs: `add_resource_metadata.cronjob` overloads the memory usage · Issue #31 · elastic/elastic-agent-autodiscover · GitHub

The interesting thing here is that CronJobs aren't involved at all. I'll ping the people working on those issues internally to see what they think.

kbujold_wr · June 27, 2023, 4:01pm

I have raised issue Potential memory leak issue with filebeat and metricbeat · Issue #35796 · elastic/beats · GitHub

The problem still has not been found.

system · July 25, 2023, 6:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticagent - filebeat is still increasing memory Beats	4	667	January 30, 2024
MetricBeat v7.3.1 memory leak Beats metricbeat	9	1277	December 10, 2019
Filebeat memory leak on Kubernetes? Beats filebeat	11	2498	August 6, 2018
Filebeat memory usage Beats filebeat	3	5878	January 18, 2019
Filebeat 8.X slow "memory leak" from kubernetes watcher Beats filebeat	0	192	June 7, 2024

Potential memory leak issue with filebeat and metricbeat

Related topics