Filebeat 7.10 fails to collect events from multiple kubernetes pods

Filebeat is configured to collect events from multiple kubernetes pods using or condition. Events from a specific pod are continuously collected but events from another pod are collected very slowly and no events are collected after sometime.

Commenting all other pods leaving a single one in the configuration works well and updates the events in the elasticsearch quickly.

There are 3 worker nodes on which filebeat (v7.10.2) is running as a daemonset. Each filebeat has cpu limits of 4 core and memory limits of 4 Gb. There will be one index generated per day and the size of index does not exceed more than 2Gb.

I want the filebeat to collect events from all the pods and update elasticsearch within no time. Please help me in understanding the issue and the best practices to improve filebeat performance.

filebeat.yml -

  filebeat.autodiscover:
    providers:
      - type: kubernetes
        node: ${NODE_NAME}
        tags:
          - "kube-logs"
        templates:
          - condition.or:
              - contains:
                  kubernetes.pod.name: "ne-db-manager"
              - contains:
                  kubernetes.pod.name: "ne-mgmt"
              - contains:
                  kubernetes.pod.name: "list-manager"
              - contains:
                  kubernetes.pod.name: "scheduler-mgmt"
              - contains:
                  kubernetes.pod.name: "sync-ne"
              - contains:
                  kubernetes.pod.name: "file-manager"
              - contains:
                  kubernetes.pod.name: "dash-board"
              - contains:
                  kubernetes.pod.name: "config-manager"
              - contains:
                  kubernetes.pod.name: "report-manager"
              - contains:
                  kubernetes.pod.name: "clean-backup"
              - contains:
                  kubernetes.pod.name: "warrior"
              - contains:
                  kubernetes.pod.name: "ne-backup"
              - contains:
                  kubernetes.pod.name: "ne-restore"
            config:
              - type: container
                paths:
                  - "/var/log/containers/*-${data.kubernetes.container.id}.log"
                multiline.type: pattern
                multiline.pattern: '^[[:space:]]'
                multiline.negate: false
                multiline.match: after
  logging.level: debug
  processors:
    - drop_event:
        when.or:
           - equals:
               kubernetes.namespace: "kube-system"
           - equals:
               kubernetes.namespace: "default"
           - equals:
               kubernetes.namespace: "logging"
  output.logstash:
    hosts: ["logstash-service.logging:5044"]
    index: filebeat
    pretty: true
  setup.template.name: "filebeat"
  setup.template.pattern: "filebeat-*"

I'm not very familiar with Kubernetes or how its discovery works/is implemented in Filebeat, but one thing you may want to try. Instead of using a condition.or with a bunch of contains clauses. Have you tried giving all the pods a singular label (ie: logging: filebeat), then selecting everything with that label? A potential issue you might be running into, is that all your contain clauses could be slowing down discovery, and leveraging a singular label as a selector for all of the pods might help. Both by only having to search for a singular match, as well as looking for an exact match rather than a contains match.

I have updated a single label for all the pods except for the last 2 pods (ne-backup & ne-restore) and deployed.
But I see that filebeat unable to collect events from other pods when it is collecting events from 'ne-backup' pods.
When I comment out 'ne-backup' from filebeat configuration, it is able to collect the expected events from rest of the pods.
Actually there are around 600+ 'ne-backup' pods that will be triggered for every 5 - 10 minutes time duration.
Due to these huge number of logs filebeat is unable to collect events from other pods I guess.
This is a working scenario in filebeat v6.5 but wondered why it is unable to work properly in v7.10.

filebeat.yml -

  filebeat.autodiscover:
    providers:
      - type: kubernetes
        node: ${NODE_NAME}
        tags:
          - "kube-logs"
        templates:
          - condition.or:
              - contains:
                  kubernetes.labels.fbapp: "filebeat"
              - contains:
                  kubernetes.pod.name: "ne-backup"
              - contains:
                  kubernetes.pod.name: "ne-restore"
            config:
              - type: container
                paths:
                  - "/var/log/containers/*-${data.kubernetes.container.id}.log"
                multiline.type: pattern
                multiline.pattern: '^[[:space:]]'
                multiline.negate: false
                multiline.match: after
  logging.level: debug
  processors:
    - drop_event:
        when.or:
           - equals:
               kubernetes.namespace: "kube-system"
           - equals:
               kubernetes.namespace: "default"
           - equals:
               kubernetes.namespace: "logging"
  output.logstash:
    hosts: ["logstash-service.logging:5044"]
    index: filebeat
    pretty: true
  setup.template.name: "filebeat"
  setup.template.pattern: "filebeat-*"

Please share me any debugging steps or suggestions.

I would recommend you enabling metrics collection on these Filebeat nodes. It will provide some useful insight to see where you're running into a limitation.

On a somewhat related note, you mention only having 3 worker nodes, but then you say you have at least 600 pods for ne-backup. This would put you at a minimum of 200 pods per node which is double the recommended 100 pods per node that Kubernetes is designed for. You may be running into some sort of Kubernetes constraint.

Actually I am using OKD-3.11 for deploying my applications and Openshift support 200 pods per node.
Also grafana is already present in my cluster and I do not see any resource outages.

CPU Memory limits -


NOTE: I see that filebeat is harvesting the 'ne-db-manager' logs but unable to collect events from it. It is able to collect events from 'ne-backup' pods that creates a large number of logfiles.

Apologies, I wasn't clear on my statement regarding monitoring of Filebeat. I meant collecting the metrics that Filebeat itself exposes via the metrics option. This includes far more information regarding events processed, queued, etc. You might not be hitting a CPU/Memory limit, but you could be hitting some sort of other limit within your environment.

Please find the metrics of filebeat here - filebeat_metrics.log - 0154ccbd
Also check a portion of the log here - filebeat.log - 33312269
Let me know if you need any more details

After looking over the metrics you provided I'm not seeing anything too offending. I'm not sure I can be of much help from here as I'm at my current knowledge of how Kubernetes and Filebeat work. The only I could possibly recommend is continuing to gather the metrics for the Filebeat agents either via Elasticsearch monitoring or Prometheus monitoring, and hope that either it provides some useful information or someone else comes across this topic and is able to provide more help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.