Filebeat 7.12 is collecting the events very slowly

bhavaniprasad_reddy · May 10, 2021, 12:25pm

I have deployed filebeat 7.12 in my cluster to collect events from kubernetes logs using autodiscover.

The events are being collected in NRT (Near Real Time) for all the pods that I wish but stopped collecting only for 2 to 3 pods after a period of time(1 or 2 hours).

There are around 260 pods out of which 240 pods are generated for every 10 to 15 minutes. Filebeat is harvesting the logs, collecting and sending the events successfully for those 240 pods and also for most of the other pods except for 2 to 3 pods.

This is the same behaviour when filebeat is sending the events either to logstash or to console directly. The missing events for those 2 to 3 pods are collected at the end when there are no more 240 pods being generated.

Updated the filebeat configuration for not to collect events from those 240 pods. This time the events are collected for all the pods in NRT.

Tried tweaking many parameters like max_procs, close_inactive, ignore_older, output.logstash.workers, output.logstash.bulk_max_size, queue.mem.events, queue.mem.flush.min_events and queue.mem.flush.timeout but none of them resolved the issue.

Resources Allocated -
RAM - 2 to 4 GB and
CPU - 2 to 4 cores

There are 4 filebeat pods running on each worker node.
CPU metrics -

Memory metrics -

Adding the filebeat configuration that I am using.

  filebeat.autodiscover:
    providers:
      - type: kubernetes
        node: ${NODE_NAME}
        tags:
          - "kube-logs"
        templates:
          - condition.or:
              - contains:
                  kubernetes.pod.name: "ne-db-manager"
              - contains:
                  kubernetes.pod.name: "ne-mgmt"
              - contains:
                  kubernetes.pod.name: "list-manager"
              - contains:
                  kubernetes.pod.name: "scheduler-mgmt"
              - contains:
                  kubernetes.pod.name: "sync-ne"
              - contains:
                  kubernetes.pod.name: "file-manager"
              - contains:
                  kubernetes.pod.name: "dash-board"
              - contains:
                  kubernetes.pod.name: "config-manager"
              - contains:
                  kubernetes.pod.name: "report-manager"
              - contains:
                  kubernetes.pod.name: "clean-backup"
              - contains:
                  kubernetes.pod.name: "warrior"
              - contains:
                  kubernetes.pod.name: "ne-ops"  #This name will be found in 240 pods
            config:
              - type: container
                paths:
                  - "/var/log/containers/*-${data.kubernetes.container.id}.log"
                multiline.type: pattern
                multiline.pattern: '^[[:space:]]'
                multiline.negate: false
                multiline.match: after
                #scan_frequency: 1s
                #close_inactive: 5m
                #ignore_older: 10m
  max_procs: 4
  filebeat.shutdown_timeout: 5s
  logging.level: debug
  processors:
    - drop_event:
        when.or:
           - equals:
               kubernetes.namespace: "kube-system"
           - equals:
               kubernetes.namespace: "default"
           - equals:
               kubernetes.namespace: "logging"
  processors:
    - fingerprint:
        fields: ["message"]
        target_field: "@metadata._id"
  output.logstash:
    hosts: ["logstash-headless.logging:5044"]
    #, "logstash-headless.logging:5045"]
    #loadbalance: true
    #workers: 16
    index: filebeat
    pretty: false
    #bulk_max_size: 1600
    #compression_level: 9
  #queue.mem:
  #  events: 51200
  #  flush.min_events: 1600
  #  flush.timeout: 1s
  setup.template.name: "filebeat"
  setup.template.pattern: "filebeat-*"

I want all the events to be collected from all the pods in NRT. Any suggestions here would be appreciated.

Mario_Castro · May 10, 2021, 8:22pm

Hi @bhavaniprasad_reddy

Filebeat deals with pods "transparently", no matter if 200 or 2 million. If two of 240 are "faulty" for whatever reason there must be something on those two pods that is different than the others.

It may be something as simple as a typo in your Yaml (but then you won't get any log at all) or something more complex like too much logs or too fast. Assuming there's no network problem or a lack of CPU/RAM for those pods for whatever reason.

Maybe you can tell us about the size and speed of those logs. Again, assuming you aren't parsing anything uncommon like really long lines (for example).

bhavaniprasad_reddy · May 11, 2021, 8:52am

The pods (2 pods out of 240) are running for more than a day without any issues and they are able to read/write to kafka successfully. I am able to tail the logs and read it without any issue since there are not too many lines being generated.

Please find below some of the metrics that you have asked for,

-> log size - 220 MB per 24 hours for pod-1 & 210 MB per 24 hours for pod-2
-> log rate - 30 lines/s on an average are created for pod-1 & 2 lines/s on an average are created for pod-2
-> network issues - Since the pods are running without any issue, I guess there are no network relates issues.
-> CPU usage on node running on those 2 pods - Reuqests: 5420m (33%), Limits: 7530m (47%)
-> Memory usage on node running on those 2 pods - Requests: 10078Mi (31%), Limits: 19326Mi (60%)
-> Parsing long lines - The longest line in the log is not more than 1000 characters, so the line length is manageable.

NOTE: I have observed that the harvester is running for those 2 log files but the events are not collected since the offset is struck at a single value without incrementing though the files are being updated..

The only issue that I see is that the disk IO utilisation is touching almost 90% in the node where these 2 pods running. Please find the screenshots for CPU, RAM & DISK IO metrics in the node where those 2 pods are running.

CPU utilization on node-

Memory utilization on node-

Disk IO utilization on node -

Please let me know if I miss anything that you have asked for.

system · June 8, 2021, 10:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat 7.10 fails to collect events from multiple kubernetes pods Beats filebeat	7	711	April 17, 2021
Filebeat 7.12 event collection rate drops after a period of time Beats filebeat	2	289	May 23, 2021
Filebeat daemonset in Kubernetes is slow (or fails) to harvest logs from multiple pods Beats filebeat	3	1080	February 7, 2023
Filebeat is able to collect events of a pod when deployed in a node but unable to collect events when the same pod is deployed in a different node Beats filebeat	1	208	July 3, 2021
Kubernetes autodiscover sending logs from only some of the identical nodes in one of our clusters Beats filebeat	9	1055	January 31, 2019

Filebeat 7.12 is collecting the events very slowly

Related topics