Losing finishing lines of a terminating / crashing container

Hello,

We are trying to setup Elastic Cloud on Kubernetes 1.4, using Filebeat 7.11.1 to harvest logs of all containers running on our Kubernetes cluster.

Since giving an annotation to all Pods we could potentially be interested in looked more difficult, we are trying to use autodiscovery. Everything works perfectly, except that Filebeat loses the last few lines before a Pod is terminated gracefully, or if a Pod crash.

It doesn't happen always, sometimes logs are correctly collected even on a full crash of the container and after some testing, I think it could be related to when the logs are written to the json-file, related to when the container actually terminates.
If logs are written too late and the container stops almost immediately after log lines are flushed to the json-file, then Filebeat loses those lines.

I've searched around and read that there are a few problems which are related to docker and Kubernetes events, together with autodiscovery but I couldn't find a proper solution to the problem.

Here is the Filebeat definition:

apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
  name: elasticsearch
  namespace: elastic-system
spec:
  type: filebeat
  version: 7.11.1
  elasticsearchRef:
    name: elasticsearch
  kibanaRef:
    name: elasticsearch
  config:
    setup.template.enabled: true
    setup.template.name: "filebeat"
    setup.template.overwrite: true
    setup.template.settings:
      _source.enabled: true
      index.number_of_shards: 5
      index.number_of_replicas: 2
    filebeat:
      autodiscover:
        providers:
        - type: kubernetes
          node: ${NODE_NAME}
          cleanup_timeout: 60
          hints.enabled: true
          # hints.default_config:
            # type: container
            # paths:
            # - /var/lib/docker/containers/${data.container.id}/*.log
            # - /var/log/pods/${data.kubernetes.pod.uid}/${data.kubernetes.container.name}/*.log
            # multiline.pattern: '^[[:space:]]+(at|\.{3})[[:space:]]+\b|^Caused by:'
            # multiline.negate: false
            # multiline.match: after
            # json.message_key: log
    processors:
    - add_host_metadata:
        netinfo.enabled: true
    - add_kubernetes_metadata:
        in_cluster: true
    - add_process_metadata:
        match_pids: [system.process.ppid]
        target: system.process.parent
    - drop_event:
        when:
          or:
            - equals: # ignore itself
                kubernetes.container.name: "filebeat"
            - equals: # ignore metallb objects
                kubernetes.namespace: "metallb-system"
            - equals: # ignore argocd objects
                kubernetes.namespace: "argocd"
            - equals: # ignore lens-metrics objects
                kubernetes.namespace: "lens-metrics"
            - equals: # ignore Percona haproxy
                kubernetes.container.name: "haproxy"
            - equals: # ignore Rook-Ceph csi-snapshotter
                kubernetes.container.name: "csi-snapshotter"
            - equals: # ignore Kubernetes coredns
                kubernetes.container.name: "coredns"
            - equals: # ignore Vasco simulator
                kubernetes.container.name: "amis-vasco-simulator"
            - regexp: # ignore debug or trace logs
                message: "(DBG|DEBUG|TRACE|debug|trace)"
            - regexp: # ignore empty lines
                message: "^$"
            - regexp: # ignore debug or trace logs
                message: "<(Trace|Debug)>"
            - contains: # ignore probes
                message: "Health check succeeded"
            - contains: # ignore probes
                message: "kube-probe"
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: filebeat
        automountServiceAccountToken: true
        terminationGracePeriodSeconds: 30
        dnsPolicy: ClusterFirstWithHostNet
        hostNetwork: false # A true value would allow to provide richer host metadata
        containers:
        - name: filebeat
          securityContext:
            runAsUser: 0
          volumeMounts:
          - name: varlibdockercontainers
            mountPath: /var/lib/docker/containers
            readOnly: true
          - name: varlog
            mountPath: /var/log
            readOnly: true
          # persisted path on container. This is expected to be persisted between runs
          - name: data
            mountPath: /usr/share/filebeat/data
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          resources:
            limits:
              cpu: 200m
              memory: 400Mi
            requests:
              cpu: 100m
              memory: 200Mi
        volumes:
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
        - name: varlog
          hostPath:
            path: /var/log
        # persisted path on host will be mounted under persisted path of the container
        - name: data
          hostPath:
            path: /var/lib/filebeat-data
            type: DirectoryOrCreate

I've tried various configuration around hints.default_config, but it wouldn't really change anything regarding this problem.

A thing I've noticed is that the json-file log for terminating containers is removed as soon as they are terminating, which I imagine being the main culript of why logs aren't being harvested for terminating containers.

For crashing containers instead, the json-file log remains (as it can be consulted with --previous on Kubernetes logs) but still, last lines of logs are not being harvested if written too late in the log file.

Am I configuring something wrong? Are there any tips regarding how things should be set-up in order to prevent this kind of problems?

Hi!

It looks similar to [filebeat] Sometimes Pod logs are not collected by Filebeat · Issue #17396 · elastic/beats · GitHub which was fixed at Fix terminating pod autodiscover issue by ChrsMark · Pull Request #20084 · elastic/beats · GitHub (7.11 version includes this one). Can you you check the similar to the GH issue, if "the harvester is terminated after the pod has finished completely"?

Also I wonder if your targeted Pods are in a state which is unhandled by Autodiscovery (similar to what was happening in the other issue).

Last but not least, it would be nice if you could provide a replicate scenario similar to what was provided in the aforementioned GH issue so as to try to reproduce and debug it.