Filebeat daemonset in Kubernetes is slow (or fails) to harvest logs from multiple pods

Hi! After a full day of pulling my hair I’m giving up and want to ask for help here :pray:

I have a Kubernetes cluster where I am running a Filebeat daemonset. I noticed that the logs from some of the pods go missing sometimes. To isolate the problem I have configured the daemonset to only run a single pod on a particular node.

  • The node has 47 pods running
  • They produce appx. 4000 lines of logs per minute (that’s all together, so about 85 lines/pod/minute)
  • The filebeat pod consumes 1200m CPU and 256Mi Mem (which seems excessive but alright)

When I let filebeat run and collect logs from all the pods, it keeps missing a lot of them. When looking at the logs of a particular pod, I see this:

  • 17:55:56: last up-to-date log entry
  • 18:14:30: more logs showed up, but only timestamps until 18:03:56 (so the last 10 minutes of logs are still not ingested)

When I restrict filebeat to only collect logs from a single namespace, then the pod’s logs are shipped immediately! And everything is as it should be. This is the filebeat config, incl. the namespace selection:

  - type: kubernetes
    - condition:
          kubernetes.namespace: specific-namespace
        - type: container
            - /var/log/containers/*-${{}}.log

So this leads me to believe there is a bottleneck somewhere.

It seems that when I kill the pod and it gets recreated, it “pushes” the queue and some of the logs show up. But then it stalls again.

:white_check_mark: The Elastic server side is fine, the index where logs are shipped has a pretty small index rate compared to others which are being indexed without a problem. The JVM heap usage is about 50% (out of 4G). 6-core CPU is at less than 25% usage with each core.
:white_check_mark: The node where filebeat is running also has enough resources to spare, about half of 4 cores and 8G mem. Filebeat pod is not limited in CPU it can use.
:white_check_mark: I don’t see anything suspicious in the filebeat logs, with debug level there’s just a lot of noise. I can look for something specific if needed. When the filebeat pod starts, I see a Harvester started for paths: [/var/log/containers/*-xxx.log] message corresponding to the pod I am diagnosing - but the logs still stop showing up afterwards.

The k8s cluster is self-managed (microk8s), Elastic (and beats) version 8.1.3.

:grey_question: Is there anything obvious to check? It seems to me that the amount of logs I want to ingest is really small and should not be a problem to handle. Any way to debug where filebeat is hanging up?
:grey_question: Is the resource usage (1200m CPU and >256Mi Mem) expected? The cluster has 4-core-nodes with 8GB each and when running the full daemonset, this consistently takes away about 25% CPU and ~5% Mem which is a lot for just collecting logs.

Thank you for any help!

Hi @melkamar

Hmmm yes, something does not seem right, small EPS, small to med Pods...

Just thinking out loud...

Are you monitoring the beats? That can provide some insights...

Can you share the entire filebeat manifest?

There should be a metrics log line with the number of ingested, published, acked, queued etc what does that look like?

Just curious if you remove the resource usage/limits what happens?
(not saying that is the fix, just another data point)

Curious are the pods "Short Lived" / constantly being deployed? (Think there was a bug with that)

Hi, sorry for taking a while to get back to this. I think this thread can be closed. I have talked to xeraa at Slack and one part of the problem (the not receiving of logs) was fixed by upgrading to 8.5.3.

The CPU/Mem usage was still high, but I managed to narrow that down to the Filebeat registry growing endlessly when containers are being restarted inside a pod. Created a separate issue about that here.