Log files are not cleared from the registry when harvested Kubernetes container is restarted

Background: we are running Filebeat as a DaemonSet in a self-managed Microk8s Kubernetes cluster. I noticed extreme CPU usage and failures/delays with delivering logs after a while. More details in this Slack thread but tl;dr was that there were way too many dead-file-references in the registry file (over 60k). Deleting the registry and starting from scratch fixed all the problems we were having with harvesting logs.

I went to check why this happened in the first place and it looks like Filebeat doesn't handle container restarts well. Our workloads are restarting every 60 minutes by design, and every restart seems to create a stale reference in the registry. These references in the registry never get removed.

I was able to reproduce this with a very simple Deployment resource which simply prints something a few times and then exits/restarts:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deployment-melka
spec:
  selector:
    matchLabels:
      melka: xo
  template:
    metadata:
      labels:
        melka: xo
    spec:
      containers:
      - name: test
        image: bash
        command:
        - bash
        - -c
        - for i in $(seq 1 15); do date; sleep 1; done; exit 1
      nodeSelector:
        kubernetes.io/hostname: my.node.com

The Filebeat daemonset setup is done according to the official documentation with the filebeat config being

filebeat.autodiscover:
  providers:
  - type: kubernetes
    templates:
    - condition:
        equals:
          kubernetes.namespace: debugging-namespace
      config:
        type: container
        ignore_older: 2h
        scan_frequency: 60s
        close_inactive: 5m
        clean_inactive: 3h
        clean_removed: true
        paths:
          - /var/log/containers/*-${{data.kubernetes.container.id}}.log

processors:
   - add_host_metadata:

output.elasticsearch:
  hosts: ['${{ELASTICSEARCH_HOST:elasticsearch}}:${{ELASTICSEARCH_PORT:9200}}']
  username: ${{ELASTICSEARCH_USERNAME}}
  password: ${{ELASTICSEARCH_PASSWORD}}

  index: "idx-%{{[agent.version]}}"

(I tried various values for the intervals of scanning, closing etc. but the default config also resulted in the same problem)

Looking at the number of lines matching the pod/deployment in the Filebeat registry:

cat /var/lib/filebeat-data/registry/filebeat/5*.json | grep deployment-melka | wc -l

I see that the number is always 1 larger than the number of restarts of the pod shown with

kubectl get pods

Looking at the files mounted into the Filebeat pod, all is as expected:

$ ls -l /var/log/containers/ | grep deployment-melka
lrwxrwxrwx  1 root root     122 Jan 10 11:57 deployment-melka-76faddxxx.log -> /var/log/pods/debugging-namespace_deployment-melka-76faddxxxx/test/18.log

The actual log files (17, 16...) are removed by microk8s when the container restarts. The name of the symlink also changes - each restart creates a different symlink.

Is there some configuration I am missing, or is this a bug? This does not reproduce when the whole pod is recreated "gracefully", e.g. with

kubectl scale deploy deployment-melka --replicas 0
kubectl scale deploy deployment-melka --replicas 1

only when the container inside a pod restarts.

only when the container inside a pod restarts.

Do you see duplicate entries with, may be different offset, in the registry when the container restarts ?

There is ofcourse a different in log file handling when a pod terminates/ evicted/ killed, that is one of the possibility, but using close and clean settings should perform housekeeping of registry well.

No, they are not duplicate:

$ cat <registry>.json | jq '.[].source'
/var/log/containers/deployment-melka-76faddbb-54ff98b96c-97ghs_elastic-beats_test-d17186be24fff22012415147131bfe58d07dd718c20db83938cbe6e04d7cd3e4.log
/var/log/containers/deployment-melka-76faddbb-54ff98b96c-97ghs_elastic-beats_test-d4b24a566879a7021f3ee963ac3069443fe8b4ca221926917d761e7671f6d497.log
(...)

$ cat <registry>.json | jq '.[].source' | wc -l
      45
$ cat <registry>.json | jq '.[].source' | sort -u | wc -l
      45

All the lines in the registry are unique (the fields "source" and "id" are different)

Weird, in my config I had:

close_*: 1h
clean_*: 73h
ignore_older: 72h

and my registry was clean and entries were being removed as per the configuration.

Indeed the best option to ensure the registry stays 'clean' is to set the clean_* options. The container input uses the log input under the hood, the registry file growing too much and causing some performance issues are know problems from the log input.

We're working on migrating the log input to filestream, but it will take a while.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.