Filebeat pods keeps increasing Memory usage

Hi all,

We are having a quite a strange issue that running filebeat in any version higher than 8.0.0 will cause filebeats agent to start increasing Memory consumption until the pod is OOM killed. As mentioned versions 8.0.0 and below work fine.

Configuration of of beat is as follows

apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
  name: mynamespace-filebeat
  namespace: mynamespace
spec:
  configRef:
    secretName: mynamespace-filebeat-config
  daemonSet:
    podTemplate:
      metadata:
        creationTimestamp: null
      spec:
        automountServiceAccountToken: true
        containers:
        - env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
          name: filebeat
          resources:
            limits:
              cpu: 1000m
              memory: 2000Mi
            requests:
              cpu: 100m
              memory: 100Mi
          volumeMounts:
          - mountPath: /var/log/containers
            name: varlogcontainers
          - mountPath: /var/log/pods
            name: varlogpods
          - mountPath: /var/lib/docker/containers
            name: varlibdockercontainers
        dnsPolicy: ClusterFirstWithHostNet
        hostNetwork: true
        securityContext:
          runAsUser: 0
        serviceAccount: mynamespace-elastic-beat-filebeat
        volumes:
        - hostPath:
            path: /var/log/containers
          name: varlogcontainers
        - hostPath:
            path: /var/log/pods
          name: varlogpods
        - hostPath:
            path: /var/lib/docker/containers
          name: varlibdockercontainers
    updateStrategy: {}
  elasticsearchRef:
    name: mynamespace-elastic
  kibanaRef:
    name: mynamespace-kibana
  monitoring:
    logs: {}
    metrics: {}
  type: filebeat
// any version higher than 8 will start fine but memory consumption will increase until pod is killed 
  version: 8.0.0 

Data handling configuration, we are filtering based on namespaces/containers namee. Analysing, extracting and enriching fields.

apiVersion: v1
kind: Secret
metadata:
  name: mynamespace-filebeat-config
  namespace: mynamespace
stringData:
  beat.yml: |
    filebeat.autodiscover:
            providers:
              - type: kubernetes
                templates:
                  - condition:
                      and:
                        - contains.kubernetes.container.name: "containerName"
                        - or:
                            - contains.kubernetes.namespace: "namespace1"
                            - contains.kubernetes.namespace: "namespace2"
                            - contains.kubernetes.namespace: "namespace3"
                            - contains.kubernetes.namespace: "namespace4"
                    config:
                      - type: container
                        paths:
                          - /var/log/containers/*${data.kubernetes.container.id}*.log
                        processors:
                          - add_kubernetes_metadata:
                              host: ${NODE_NAME}
                              matchers:
                              - logs_path:
                                  logs_path: "/var/log/containers/"
                          - drop_event.when:
                              not:
                                contains:
                                  message: "Bearer" 
                          - dissect:
                              when:
                                and:
                                  - contains:
                                      kubernetes.container.name: "containerName"
                                  - contains:
                                      message: "Bearer"
                              tokenizer: '%{potential_space}request:"%{request}" response_code:%{response_code} authorization:"Bearer %{encoded_jwt_header}.%{encoded_jwt_payload}.%{encoded_jwt_signature}" authority:"%{authority}"'
                              field: "message"
                              target_prefix: ""
                          - copy_fields:
                              when:
                                and:
                                  - contains:
                                      kubernetes.container.name: "containerName"
                                  - has_fields: ['request']
                              fields:
                                - from: request
                                  to: endpoint
                              fail_on_error: true
                              ignore_missing: false
                          - script:
                              when:
                                and:
                                  - contains:
                                      kubernetes.container.name: "containerName"
                                  - has_fields: ['endpoint']
                              lang: javascript
                              id: strip_endpoint_value
                              source: >
                                function process(event) {
                                    // Extract endpoint without parameters
                                    event.Put('endpoint', event.Get('endpoint').replace(/^\S* ([^?]*).* .*/,'$1'))
                                }
                          - script:
                              when:
                                and:
                                  - contains:
                                      kubernetes.container.name: "containerName"
                                  - has_fields: ['encoded_jwt_payload']
                              lang: javascript
                              id: prepare_base64_decoding
                              source: >
                                function process(event) {
                                    event.Put('encoded_jwt_payload', event.Get('encoded_jwt_payload') + Array((4 - event.Get('encoded_jwt_payload').length% 4) % 4 + 1).join('='))
                                }
                          - decode_base64_field:
                              when:
                                and:
                                  - contains:
                                      kubernetes.container.name: "containerName"
                                  - has_fields: ['encoded_jwt_payload']
                              field:
                                from: "encoded_jwt_payload"
                                to: "decoded_jwt_payload"
                              ignore_missing: false
                              fail_on_error: true
                          - decode_json_fields:
                              when:
                                and:
                                  - contains:
                                      kubernetes.container.name: "containerName"
                                  - has_fields: ['decoded_jwt_payload']
                              fields: ["decoded_jwt_payload"]
                              process_array: false
                              max_depth: 1
                              target: ""
                              overwrite_keys: false
                              add_error_key: true
                          - include_fields:
                              when:
                                and:
                                  - contains:
                                      kubernetes.container.name: containerName"
                                  - has_fields: ['decoded_jwt_payload']
                              fields: ["field1", "field2", "field3", "field4", "field5", "field6", "field7"]
                  - condition:
                      and:
                        - not.contains.kubernetes.container.name: "containerName"
                        - or:
                            - contains.kubernetes.namespace: "namespace1"
                            - contains.kubernetes.namespace: "namespace2"
                            - contains.kubernetes.namespace: "namespace3"
                            - contains.kubernetes.namespace: "namespace4"
                            - contains.kubernetes.namespace: "otherNamespace"
                    config:
                      - type: container
                        paths:
                          - /var/log/containers/*${data.kubernetes.container.id}*.log
                        add_kubernetes_metadata:
                          host: ${NODE_NAME}
                          matchers:
                          - logs_path:
                              logs_path: "/var/log/containers/"
                        processors:
                          - dissect:
                              tokenizer: '%{datetime} [%{thread}] %{loglevel->} %{logger} %{msg}'
                              field: "message"
                              target_prefix: ""
                        multiline:
                          pattern: '^([0-9]{4}-[0-9]{2}-[0-9]{2})'
                          negate: true
                          match: after
                  - condition:
                      and:
                        - not.contains.kubernetes.container.name: "containerName"
                        - or:
                            - contains.kubernetes.namespace: "namespace5"
                            - contains.kubernetes.namespace: "namespace6"
                            - contains.kubernetes.namespace: "namespace7"
                    config:
                      - type: container
                        paths:
                          - /var/log/containers/*${data.kubernetes.container.id}*.log
                        add_kubernetes_metadata:
                          host: ${NODE_NAME}
                          matchers:
                          - logs_path:
                              logs_path: "/var/log/containers/"
    setup.template.settings:
      index.number_of_shards: 20
      index.number_of_replicas: 1

Here a snapshot of a pod running on 8.0.0

Here a snapshot of a pod running on 8.6.1

Things we tried:

  • Include mem.queue config decreasing values:
  queue.mem:
    events: 2048
    flush.min_events: 256
    flush.timeout: 5s
  • Decrease number of shards from 20 to 3.

If anybody has any pointers for us on what we could possibly check or change on configuration to make this work on current versions would be highly appreciated.

Note: We have about 19GiB - 21GiB data per week, retaining for 90 days in cases it's relevant.
Thanks,
Andre

We have witnessed some cases of OOM situation in specific clusters with many cronjobs for example.

Can you try under processor: add_kubernetes_metadata, to use

add_resource_metadata
  deployment: false
  cronjob: false

Relevant documentation here: Add Kubernetes metadata | Filebeat Reference [8.6] | Elastic

1 Like

Hello @Andreas_Gkizas , thanks so much for the reply.

I updated beats back to 8.6.1 and have applied your configuration you suggested in all the blocks of add_kubernetes_metadata: and I have tested on our sandbox environment. Unfortunately I can see the trend is the same and once I hit the cluster with a load test, memory starts getting out of control until OOM.

I cannot really figure this one and don't understand as the setup on version 8.0.0 runs perfectly on 100Mib but any other higher version will kill the pods. If you have any other insight for me, I would highly appreciate it. Thank you!

Andre

Hello Andre,

I clearly see that multiple conditions and extra processing is not the problem here as in 8.0.0 memory is low.

Can you please verify that the number of events you receive in 8.0 is the same(or almost the same) as 8.6? I want to exclude the case that something was not working before :slight_smile:

Also can you please take a pprof dump and upload it here for me in order to check where this memory is consumed?
Info how to:

  1. Start the Beat process with httpprof (profiling) enabled. This allows us to easily extract memory profiles of the running process. Add these configuration options:
http.host: localhost
http.port: 6060
http.pprof.enabled: true
  1. Once Beats is started and is done initializing (after 5-10 minutes), connect inside your filebeat pod and you can collect the first memory dump via a simple curl command like this: curl -s -v http://localhost:8080/debug/pprof/heap > heap_normal.bin.
  2. Once you start noticing that the process is taking excessive amounts of memory, a second dump needs to be generated like curl -s -v http://localhost:8080/debug/pprof/heap > heap_high.bin.

Please add also some logs from filebeat from same period.

In the meanwhile:

  • Do we have any frequent restart/ pods or in pending state? Too many cronjobs (although I guess not as the config suggested previously would have solve it)?
  • Other than that maybe kubectl get events --sort-by=.metadata.creationTimestamp -n <namespace_you_monitor>, to see if we have something interesting here?
  • Also increasing your limits:
resources:
             limits:
             memory: 500Mi
           requests:
             cpu: 100m
             memory: 200Mi

Does this help you solve your problem or the issue is that memory constantly increases and eventually we will run OOM no matter how big limit we set?

Hi Andreas,

I have tried to send you directly the logs. To answer your remarks here goes a little clarification about the environment.

We are currently using a sandbox environment identical to our production to simulate our problem. As the production cluster is stable on version 8.0.0, we don't want to do any tests there.
Having said that, we are using a load test and hitting our api with bunch requests to check how filebeat is behaving on sandbox. While Filebeat doesn't totally crash when we run the load test, we can see that the instances running on the same node where one of our app is running will eventually be OOM killed.

Some of the points you mentioned.

  • I can confirm that the same load test was used and the amount of request is exactly the same. The result is following:
    Filebeat Version 8.6.1 - OOM Killed
    Filebeat Version 8.3.1 - OOM Killed
    Filebeat Version 8.2.0 - OOM Killed
    Filebeat Version 8.1.0 - OOM Killed
    Filebeat Version 8.0.0 - works
    Filebeat Version 7.17.0 - works

Mind you, we just changed the version of Filebeat. ES and Kibana versions are 8.6.1.

  • Memory limit/requests. We have increased the memory values for Filebeat in our Production while we were troubleshooting the issue there.
    Bottom line is that Filebeat in versions > 8.0.0 will use as much memory as it is given. Please see secreenshot of my initial post. Version 8.0.0 running on production runs fine with 100Mib request and 300Mib limits. It has not a single restart.

  • There are no cronjobs running on the cluster and the deployments are pretty stable. No restarts or anything of the sort.

  • I couldn't see anything relevant on kubectl events one or two restarts of pods on the namespace where elaatic runs or the one the application runs.

I hope this makes a bit clearer on what' going on and let me know if you got the logs.

Rgds,
Andre

I am searching in our code to see what might explain this.

This PR: Use NamespaceAwareResourceMetaGenerator for all generic kubernetes resources by tetianakravchenko · Pull Request #33763 · elastic/beats · GitHub could be. Can you please test with 7.17.9 as well?

Also for my understandig lets say you increased the memory to something big like 2Gi. Is filebeat able to handle your load? Is the memory keep increasing?

I have received now you logs will have a look next days.
Thank you as well

Hi @Andreas_Gkizas,
I finally got round to test the version 7.17.9 and I can confirm I am seeing the problem there as well. Memory consumption going up until pod eventually restarts. If this is change has been introduced from version 8.0.x to 8.1.0. It could well be the reason this is happening.

As of your second question, the first thing we did when we noticed the issue was to increase the memory limit to 2GiB. Unfortunately after one or two hours we would see the issue again and adding the poor handling of the lock file on version 8.6.1 and lower, our system would not recover from restarts and if left unassisted we would lose data from logs.

If there is anything to test, or try out meanwhile.. please do let us know.

Rgds,
Andre

Hello Andres,

I have been checking your dumps and I can see for now that two things are most memory consuming the registry reader (GetFields function) and the kubernetes_metadata library.
(I am not going to add any screenshots here, will only add more noise)

To our suggestions then:

  1. Try to remove the add_kubernetes_metatada config. The metadata enrichment should happen in the background. We try to avoid any double calls of the library.
  2. If first does not work try add_resource_metadata.namespace: false . I forgot to mention it. Let see if this can workaround your issue
  3. I see a lot of lines in your logs like the following:
{"log.level":"info","@timestamp":"2023-02-13T15:14:06.940Z","log.logger":"input.harvester","log.origin":{"file.name":"log/harvester.go","file.line":337},"message":"Reader was closed. Closing.","service.name":"filebeat","input_id":"e02069c0-1226-442e-afb0-f3d6cd1bb401","source_file":"/var/log/containers/ad-login-ingress-nginx-controller-6bcfd98bc8-k9jqs_ad-login_controller-56bc20f3e0b1e6f9ced94e04ed3406d2b9b8e23dfe440b75101b1bcf52fb3eba.log","state_id":"native::258154-2049","finished":false,"os_id":"258154-2049","old_source":"/var/log/containers/ad-login-ingress-nginx-controller-6bcfd98bc8-k9jqs_ad-login_controller-56bc20f3e0b1e6f9ced94e04ed3406d2b9b8e23dfe440b75101b1bcf52fb3eba.log","old_finished":true,"old_os_id":"258154-2049","harvester_id":"00bbbb86-a60b-4c81-937d-f40e60fb2a46","ecs.version":"1.6.0"}

Can you please describe me how you conduct the tests? Is it like you kill the pods and recreating them?

  1. I would also try to minimise the problem to see what condition can be problematic from what you have in your configuration. Can you just keep one condition at a time and repeat tests? Like that we could minimise and focus to the problematic case

We are investigating with the team and get back to you.

Any update on this issue, Because I am confronted to something similar also on filebeat 8.6.1

Hi @germain_nganko ,

I was away on holidays and couldn't test the last suggestion from Andreas. I will do it today or tomorrow and will report back.

Thanks
Andre

Hi @Andreas_Gkizas,
Quick update from my side. I tested the first recommendation

It has worked fine on my test environment and it's been working fine on my production environment for couple of hours now. I will keep on eye on it for the next two days but it does look promising.

Thank you for all your support. I give you a final feedback in couple of days!
Andre

Hi all,

I want to confirm the fix. Removing the dd_kubernetes_metatada config on the three spots we had it in fixed our problem. We are running on version 8.6.0 of Filebeat without problems for five days now.
@Andreas_Gkizas, can you quickly explain again, what is the background of this config. If I am not mistaken we took that from sample configurations from Elastic documentation.
It is something that is coming by default after the PR you linked and when we left the config there, it was just doing twice or something? If it's indeed the case, looks like Filebeat should have been able to better handle this.

Anyways, I want to thank you for your prompt help with the issue!!! It was awesome!

Andre

Hello,

We are currently experiencing a similar issue with our Filebeat v7.16.2 pods. Although we use the add_kubernetes_metadata configuration, we cannot remove it as we rely on the information it provides for our project.

Could you please suggest an alternative solution to this problem?

Thank you,
Vlad

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.