Filebeat daemonset in Kubernetes is slow (or fails) to harvest logs from multiple pods

melkamar · January 6, 2023, 5:56pm

Hi! After a full day of pulling my hair I’m giving up and want to ask for help here

I have a Kubernetes cluster where I am running a Filebeat daemonset. I noticed that the logs from some of the pods go missing sometimes. To isolate the problem I have configured the daemonset to only run a single pod on a particular node.

The node has 47 pods running
They produce appx. 4000 lines of logs per minute (that’s all together, so about 85 lines/pod/minute)
The filebeat pod consumes 1200m CPU and 256Mi Mem (which seems excessive but alright)

When I let filebeat run and collect logs from all the pods, it keeps missing a lot of them. When looking at the logs of a particular pod, I see this:

17:55:56: last up-to-date log entry
18:14:30: more logs showed up, but only timestamps until 18:03:56 (so the last 10 minutes of logs are still not ingested)

When I restrict filebeat to only collect logs from a single namespace, then the pod’s logs are shipped immediately! And everything is as it should be. This is the filebeat config, incl. the namespace selection:

filebeat.autodiscover:
  providers:
  - type: kubernetes
    templates:
    - condition:
        equals:
          kubernetes.namespace: specific-namespace
      config:
        - type: container
          paths:
            - /var/log/containers/*-${{data.kubernetes.container.id}}.log

So this leads me to believe there is a bottleneck somewhere.

It seems that when I kill the pod and it gets recreated, it “pushes” the queue and some of the logs show up. But then it stalls again.

The Elastic server side is fine, the index where logs are shipped has a pretty small index rate compared to others which are being indexed without a problem. The JVM heap usage is about 50% (out of 4G). 6-core CPU is at less than 25% usage with each core.
The node where filebeat is running also has enough resources to spare, about half of 4 cores and 8G mem. Filebeat pod is not limited in CPU it can use.
I don’t see anything suspicious in the filebeat logs, with debug level there’s just a lot of noise. I can look for something specific if needed. When the filebeat pod starts, I see a Harvester started for paths: [/var/log/containers/*-xxx.log] message corresponding to the pod I am diagnosing - but the logs still stop showing up afterwards.

The k8s cluster is self-managed (microk8s), Elastic (and beats) version 8.1.3.

Is there anything obvious to check? It seems to me that the amount of logs I want to ingest is really small and should not be a problem to handle. Any way to debug where filebeat is hanging up?
Is the resource usage (1200m CPU and >256Mi Mem) expected? The cluster has 4-core-nodes with 8GB each and when running the full daemonset, this consistently takes away about 25% CPU and ~5% Mem which is a lot for just collecting logs.

Thank you for any help!

stephenb · January 6, 2023, 6:30pm

Hi @melkamar

Hmmm yes, something does not seem right, small EPS, small to med Pods...

Just thinking out loud...

Are you monitoring the beats? That can provide some insights...

Can you share the entire filebeat manifest?

There should be a metrics log line with the number of ingested, published, acked, queued etc what does that look like?

Just curious if you remove the resource usage/limits what happens?
(not saying that is the fix, just another data point)

Curious are the pods "Short Lived" / constantly being deployed? (Think there was a bug with that)

melkamar · January 10, 2023, 12:08pm

Hi, sorry for taking a while to get back to this. I think this thread can be closed. I have talked to xeraa at Slack and one part of the problem (the not receiving of logs) was fixed by upgrading to 8.5.3.

The CPU/Mem usage was still high, but I managed to narrow that down to the Filebeat registry growing endlessly when containers are being restarted inside a pod. Created a separate issue about that here.

system · February 7, 2023, 2:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat daemonset losing messages Beats docker , filebeat	2	841	June 22, 2022
Filebeat 7.12 is collecting the events very slowly Beats filebeat	3	706	June 8, 2021
Filebeat 7.10 fails to collect events from multiple kubernetes pods Beats filebeat	7	711	April 17, 2021
Filebeat on kubernetes not pulling logs from all pods Beats filebeat	1	643	July 29, 2020
Trying to understand Filebeat, Kubernetes logrotation and missing logs Beats filebeat	0	1700	July 15, 2024

Filebeat daemonset in Kubernetes is slow (or fails) to harvest logs from multiple pods

Related topics