Hi! After a full day of pulling my hair I’m giving up and want to ask for help here
I have a Kubernetes cluster where I am running a Filebeat daemonset. I noticed that the logs from some of the pods go missing sometimes. To isolate the problem I have configured the daemonset to only run a single pod on a particular node.
- The node has 47 pods running
- They produce appx. 4000 lines of logs per minute (that’s all together, so about 85 lines/pod/minute)
- The filebeat pod consumes 1200m CPU and 256Mi Mem (which seems excessive but alright)
When I let filebeat run and collect logs from all the pods, it keeps missing a lot of them. When looking at the logs of a particular pod, I see this:
-
17:55:56
: last up-to-date log entry -
18:14:30
: more logs showed up, but only timestamps until18:03:56
(so the last 10 minutes of logs are still not ingested)
When I restrict filebeat to only collect logs from a single namespace, then the pod’s logs are shipped immediately! And everything is as it should be. This is the filebeat config, incl. the namespace selection:
filebeat.autodiscover:
providers:
- type: kubernetes
templates:
- condition:
equals:
kubernetes.namespace: specific-namespace
config:
- type: container
paths:
- /var/log/containers/*-${{data.kubernetes.container.id}}.log
So this leads me to believe there is a bottleneck somewhere.
It seems that when I kill the pod and it gets recreated, it “pushes” the queue and some of the logs show up. But then it stalls again.
The Elastic server side is fine, the index where logs are shipped has a pretty small index rate compared to others which are being indexed without a problem. The JVM heap usage is about 50% (out of 4G). 6-core CPU is at less than 25% usage with each core.
The node where filebeat is running also has enough resources to spare, about half of 4 cores and 8G mem. Filebeat pod is not limited in CPU it can use.
I don’t see anything suspicious in the filebeat logs, with debug level there’s just a lot of noise. I can look for something specific if needed. When the filebeat pod starts, I see a Harvester started for paths: [/var/log/containers/*-xxx.log]
message corresponding to the pod I am diagnosing - but the logs still stop showing up afterwards.
The k8s cluster is self-managed (microk8s), Elastic (and beats) version 8.1.3.
Is there anything obvious to check? It seems to me that the amount of logs I want to ingest is really small and should not be a problem to handle. Any way to debug where filebeat is hanging up?
Is the resource usage (1200m CPU and >256Mi Mem) expected? The cluster has 4-core-nodes with 8GB each and when running the full daemonset, this consistently takes away about 25% CPU and ~5% Mem which is a lot for just collecting logs.
Thank you for any help!