Hello,
Some information about the system as this can help:
- Installation inside K8S using elastic helm charts.
- Filebeat version 7.16.3
- Amount of pods per nodes: ~160
- Amount of logs harvested per nodes: ~100
- CPU usage reported by K8S: around 1858m (limit at 2000)
- Mix of K8S discovery (90% of the logs) and Docker discovery (for kubelet and some other component)
- Some of the pods update one of their annotations quite often (postgresql pods deployed via the postgresql-operator of Zalando).
I did spent most of the day of Friday to try to understand why in one of our cluster we had an abnormal load on these pods (which resulted in some high delay on log ingest).
From all my checks, this seems to be somewhat related to the registry of filebeat.. However I am not too sure why. What make me believe that it is the registry, is the following test:
- Spawned a debian pod
- Copied the filebeat executable in
- Started filebeat without it's registry and made sure it would just output the log as a file.
- One it did catchup the logs to be on the same level as the one in the node, changed to output to send in the same Elasticsearch as the pod in the daemonset.
CPU of that pod was around 300m, CPU of the one from the daemonset was 1800... Around 6 time.. for the same logs shipped.
I then copied the registry of the daemonset into my pod and restarted the filebeat and the load went from 300 to 1800..
For the impacted registry:
total 8
-rw------- 1 root root 0 Feb 28 12:14 filebeat.lock
-rw------- 1 root root 95 Nov 3 14:00 meta.json
drwxr-x--- 3 root root 4096 Nov 3 14:00 registry
./registry:
total 4
drwxr-x--- 2 root root 4096 Feb 28 15:43 filebeat
./registry/filebeat:
total 16988
-rw------- 1 root root 4603323 Feb 28 15:43 206089742237.json
-rw------- 1 root root 60 Feb 28 15:43 active.dat
-rw------- 1 root root 2244608 Feb 28 15:43 checkpoint.new
-rw------- 1 root root 10533868 Feb 28 15:43 log.json
-rw------- 1 root root 15 Nov 3 14:00 meta.json
The json seems to be around 4.4M and contain ~40k lines ... I don't really understand why as the active logs and existing logs count is not more than 130 last time I checked. I did check the content of the json.log file and alot of the logs points to pods that doesn't exist since long with sometimes negative TTL.
Example here:
{"k":"filebeat::logs::native::3285809-66306","v":{"prev_id":"","offset":37439,"timestamp":[134818665,1636305448],"ttl":-2,"identifier_name":"native","id":"native::3285809-66306","source":"/var/log/containers/apps-deployment-5c77657d-ts9z4_staging-red-2021-06_apps-80c8db8ee89ff8be033d5589e788df96fa7c819669facb7fc62ca502af42b11d.log","type":"container","FileStateOS":{"inode":3285809,"device":66306}}}
{"op":"set","id":206094215491}
Doing a pprof and checking the cpu usage, seems to show alot of time on the filePath.matchChunk function. Which make me believe that the filebeat keep checking all these files...
The documentation mention that clean_removed is enabled by default, and I checked my configuration and we do not overwrite that one at all. Logs also don't seem to show any cleaning errors...
Last note, even if the logs that is displayed as example is not part of it, having filebeat in debug show alot of reload due to our databases updating their annotation and triggering the "update" event of the discovery, which is amplified as from what I noticed for each port of each container, an event is sent for the same pod (not too sure why?). Removing the db from the discovery only impact slightly the cpu usage (1800m > 1600m) so it doesn't seem to be linked.
Could you please help us understand what is going on, on that cluster and what can be done to reduce the CPU load? Wiping out the registry could help I guess but this would mean duplicating our logs and might not work in the long run..
Kr,
Pierre