Background: we are running Filebeat as a DaemonSet in a self-managed Microk8s Kubernetes cluster. I noticed extreme CPU usage and failures/delays with delivering logs after a while. More details in this Slack thread but tl;dr was that there were way too many dead-file-references in the registry file (over 60k). Deleting the registry and starting from scratch fixed all the problems we were having with harvesting logs.
I went to check why this happened in the first place and it looks like Filebeat doesn't handle container restarts well. Our workloads are restarting every 60 minutes by design, and every restart seems to create a stale reference in the registry. These references in the registry never get removed.
I was able to reproduce this with a very simple Deployment resource which simply prints something a few times and then exits/restarts:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deployment-melka
spec:
selector:
matchLabels:
melka: xo
template:
metadata:
labels:
melka: xo
spec:
containers:
- name: test
image: bash
command:
- bash
- -c
- for i in $(seq 1 15); do date; sleep 1; done; exit 1
nodeSelector:
kubernetes.io/hostname: my.node.com
The Filebeat daemonset setup is done according to the official documentation with the filebeat config being
filebeat.autodiscover:
providers:
- type: kubernetes
templates:
- condition:
equals:
kubernetes.namespace: debugging-namespace
config:
type: container
ignore_older: 2h
scan_frequency: 60s
close_inactive: 5m
clean_inactive: 3h
clean_removed: true
paths:
- /var/log/containers/*-${{data.kubernetes.container.id}}.log
processors:
- add_host_metadata:
output.elasticsearch:
hosts: ['${{ELASTICSEARCH_HOST:elasticsearch}}:${{ELASTICSEARCH_PORT:9200}}']
username: ${{ELASTICSEARCH_USERNAME}}
password: ${{ELASTICSEARCH_PASSWORD}}
index: "idx-%{{[agent.version]}}"
(I tried various values for the intervals of scanning, closing etc. but the default config also resulted in the same problem)
Looking at the number of lines matching the pod/deployment in the Filebeat registry:
cat /var/lib/filebeat-data/registry/filebeat/5*.json | grep deployment-melka | wc -l
I see that the number is always 1 larger than the number of restarts of the pod shown with
kubectl get pods
Looking at the files mounted into the Filebeat pod, all is as expected:
$ ls -l /var/log/containers/ | grep deployment-melka
lrwxrwxrwx 1 root root 122 Jan 10 11:57 deployment-melka-76faddxxx.log -> /var/log/pods/debugging-namespace_deployment-melka-76faddxxxx/test/18.log
The actual log files (17, 16...) are removed by microk8s when the container restarts. The name of the symlink also changes - each restart creates a different symlink.
Is there some configuration I am missing, or is this a bug? This does not reproduce when the whole pod is recreated "gracefully", e.g. with
kubectl scale deploy deployment-melka --replicas 0
kubectl scale deploy deployment-melka --replicas 1
only when the container inside a pod restarts.