Filebeat daemonset losing messages


Roughly this question has been asked a few times over the years but I've yet to find any real solutions.

We've got filebeat deployed as a daemonset in an on-prem k8s cluster.

The config is very simple.

    - type: kubernetes
      node: ${NODE_NAME}
      hints.enabled: true

  - add_kubernetes_metadata:
      in_cluster: true
      labels.dedot: true
      annotations.dedot: true

  - add_fields:
      target: kubernetes
        cluster: ${KUBERNETES_CLUSTER}


My problem appears two fold.

  1. filebeat isn't fast enough to process all the logs.
  2. filebeat is closing files when they are removed.

We're sending ~5k HTTP request per second to 3 pods across 3 k8s nodes (1600 hits per second per pod) for 10 minutes. So about 3 million requests in total. Only about 2.3 million end up viewable in kibana.
1600 per second is the max the service pods can handle in this load test. We've run this same test at least 5 times with the same result.

So to expand on my points above, it appears like filebeat isn't able to keep up with the rate of messages being generated and then filebeat closes the files once they are removed by something (docker?) rotating then deleting them.

2022-05-24T13:53:02.904Z	INFO	[input.harvester]	log/harvester.go:332	File was removed. Closing because close_removed is enabled.	{"input_id": "f6919c35-ef1c-4970-89f8-f2a8e5d25b2f", "source": "/var/lib/docker/containers/220fb0a62a25377d0d698e9fd23bd8c1712064d691cc85031a9cd22fca9f5a3f/220fb0a62a25377d0d698e9fd23bd8c1712064d691cc85031a9cd22fca9f5a3f-json.log", "state_id": "native::2760806-2064", "finished": false, "os_id": "2760806-2064", "old_source": "/var/lib/docker/containers/220fb0a62a25377d0d698e9fd23bd8c1712064d691cc85031a9cd22fca9f5a3f/220fb0a62a25377d0d698e9fd23bd8c1712064d691cc85031a9cd22fca9f5a3f-json.log", "old_finished": true, "old_os_id": "2760806-2064", "harvester_id": "22cbe234-9f52-46da-bd86-859092a75748"}

220fb0a62a25377d0d698e9fd23bd8c1712064d691cc85031a9cd22fca9f5a3f is the ID of the container serving the HTTP requests on this node.

I'd like to set close_removed: false but I don't know where to set it as I'm not using the log filebeat.input.

In earlier testing we were running at about 15k HTTP requests per second across the 3 pods, the approximate percentage rate of loss was the same. Very roughly 25-30% loss.
My earlier statement of filebeat not being fast enough was disingenuous. We know it can handle something like 11k messages per second, how it then can't handle 5k messages per second is awful confusing.

The k8s nodes aren't struggling, the Elasticsearch cluster isn't struggling.

We do have the filebeat metrics available and we do have logs from filebeat that "Non-zero metrics in the last 30s".
I don't know what to make of the stats metrics though.
Like, for a pod that has been running for 49 minutes, is this bad?

$ kubectl exec filebeat-filebeat-9bbr7 -n infra-monitoring -- curl -s -XGET 'localhost:5066/stats?pretty' | jq .beat.cgroup.cpu.stats
  "periods": 27352,
  "throttled": {
    "ns": 839394825410,
    "periods": 3491

periods does grow fast while load testing is happening, and throttled sounds bad.

Any help would be appreciated.

By changing the filebeat config, and resource limits, as below I was able to get the difference between requests reportedly made to requests in Elasticsearch down to 0.38%.

    - type: kubernetes
      node: ${NODE_NAME}
      hints.enabled: true
        - type: config
            close_removed: false
            clean_removed: false

And the CPU resource limits on the containers increased to 3 (3000m), although 2 would probably have been enough.

At this point I suspect the remaining loss is either in kubernetes or docker.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.