Filebeat 7.12 is collecting the events very slowly

I have deployed filebeat 7.12 in my cluster to collect events from kubernetes logs using autodiscover.

The events are being collected in NRT (Near Real Time) for all the pods that I wish but stopped collecting only for 2 to 3 pods after a period of time(1 or 2 hours).

There are around 260 pods out of which 240 pods are generated for every 10 to 15 minutes. Filebeat is harvesting the logs, collecting and sending the events successfully for those 240 pods and also for most of the other pods except for 2 to 3 pods.

This is the same behaviour when filebeat is sending the events either to logstash or to console directly. The missing events for those 2 to 3 pods are collected at the end when there are no more 240 pods being generated.

Updated the filebeat configuration for not to collect events from those 240 pods. This time the events are collected for all the pods in NRT.

Tried tweaking many parameters like max_procs, close_inactive, ignore_older, output.logstash.workers, output.logstash.bulk_max_size,, queue.mem.flush.min_events and queue.mem.flush.timeout but none of them resolved the issue.

Resources Allocated -
RAM - 2 to 4 GB and
CPU - 2 to 4 cores

There are 4 filebeat pods running on each worker node.
CPU metrics -

Memory metrics -

Adding the filebeat configuration that I am using.

      - type: kubernetes
        node: ${NODE_NAME}
          - "kube-logs"
          - condition.or:
              - contains:
              - contains:
              - contains:
              - contains:
              - contains:
              - contains:
              - contains:
              - contains:
              - contains:
              - contains:
              - contains:
              - contains:
         "ne-ops"  #This name will be found in 240 pods
              - type: container
                  - "/var/log/containers/*-${}.log"
                multiline.type: pattern
                multiline.pattern: '^[[:space:]]'
                multiline.negate: false
                multiline.match: after
                #scan_frequency: 1s
                #close_inactive: 5m
                #ignore_older: 10m
  max_procs: 4
  filebeat.shutdown_timeout: 5s
  logging.level: debug
    - drop_event:
           - equals:
               kubernetes.namespace: "kube-system"
           - equals:
               kubernetes.namespace: "default"
           - equals:
               kubernetes.namespace: "logging"
    - fingerprint:
        fields: ["message"]
        target_field: "@metadata._id"
    hosts: ["logstash-headless.logging:5044"]
    #, "logstash-headless.logging:5045"]
    #loadbalance: true
    #workers: 16
    index: filebeat
    pretty: false
    #bulk_max_size: 1600
    #compression_level: 9
  #  events: 51200
  #  flush.min_events: 1600
  #  flush.timeout: 1s "filebeat"
  setup.template.pattern: "filebeat-*"

I want all the events to be collected from all the pods in NRT. Any suggestions here would be appreciated.

Hi @bhavaniprasad_reddy :slightly_smiling_face:

Filebeat deals with pods "transparently", no matter if 200 or 2 million. If two of 240 are "faulty" for whatever reason there must be something on those two pods that is different than the others.

It may be something as simple as a typo in your Yaml (but then you won't get any log at all) or something more complex like too much logs or too fast. Assuming there's no network problem or a lack of CPU/RAM for those pods for whatever reason.

Maybe you can tell us about the size and speed of those logs. Again, assuming you aren't parsing anything uncommon like really long lines (for example).

The pods (2 pods out of 240) are running for more than a day without any issues and they are able to read/write to kafka successfully. I am able to tail the logs and read it without any issue since there are not too many lines being generated.

Please find below some of the metrics that you have asked for,

-> log size - 220 MB per 24 hours for pod-1 & 210 MB per 24 hours for pod-2
-> log rate - 30 lines/s on an average are created for pod-1 & 2 lines/s on an average are created for pod-2
-> network issues - Since the pods are running without any issue, I guess there are no network relates issues.
-> CPU usage on node running on those 2 pods - Reuqests: 5420m (33%), Limits: 7530m (47%)
-> Memory usage on node running on those 2 pods - Requests: 10078Mi (31%), Limits: 19326Mi (60%)
-> Parsing long lines - The longest line in the log is not more than 1000 characters, so the line length is manageable.

NOTE: I have observed that the harvester is running for those 2 log files but the events are not collected since the offset is struck at a single value without incrementing though the files are being updated..

The only issue that I see is that the disk IO utilisation is touching almost 90% in the node where these 2 pods running. Please find the screenshots for CPU, RAM & DISK IO metrics in the node where those 2 pods are running.

CPU utilization on node-

Memory utilization on node-

Disk IO utilization on node -

Please let me know if I miss anything that you have asked for.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.