Filebeat makes too many API calls choking and bringing down the Kubernetes cluster

Hi,

We have filebeat version: 7.12.0 running as a daemonset on all our production K8s clusters. Recently, we faced a catastrophic situation where our whole K8s cluster with around 300+ nodes was brought down by filebeat.

We have configured filebeat with around 35 different input.d configurations with most of them being either type: container or type: log.

From one of those 35 input.d configs, we configured a wrong regex on exclude_files which started to throw errors like:

Error creating runner from config: error parsing regexp: missing argument to repetition operator: `*` accessing
 'exclude_files.0' (source:'/opt/filebeat/etc/inputs.d/abc.yml')

The configuration for that particular prospector was:

- type: log
  paths:
    - /var/log/containers/*_abc_*.log
  symlinks: true
  exclude_files: ["*xyz*"] // MALFORMED REGEX

  json.keys_under_root: false
  
  fields:
    topic: "abc"

  fields_under_root: true
  processors:
    - add_kubernetes_metadata:
        in_cluster: true
        default_matchers.enabled: false
        matchers:
        - logs_path:
            logs_path: /var/log/containers/

This bad regex configuration triggered hell lot of API calls to kube-apiserver for kubernetes_metadata fetch, that it brought down the master node itself which eventually brought down the working of the kubernetes cluster itself.

How can we make filebeat better to handle these kind of misconfigs properly and consequently doesn't bring down the entire k8s?

What are the suggestions in general on the usage of add_kubernetes_metadata? Coz I think if I compile all the different input.d configurations, into a single one with just one add_kubernetes_metadata config, the number of requests to K8s API will significantly reduce and may help in not bringing down the entire K8s cluster. (Please let me know if that will help.)

Attaching some images to provide a gist on the increase in the number of K8s API requests.

Hi @ayush_mundra !
What happened here is interesting but hard to spot the root cause only from the description. In general if you define the processor for each of the inputs then you will have a single processor instantiated per input with its one cache. This indeed can lead in more load compared to having one processor on Filebeat's level. Hope it helps a little bit :)!

C.

1 Like

Hi @ChrsMark
Thanks for the reply!

What happened here is interesting but hard to spot the root cause only from the description.

Please let me know how I can make it more elaborate and what kind of informations, you require from my end.

In general if you define the processor for each of the inputs then you will have a single processor instantiated per input with its one cache.

Yes, that's correct and we actually tried it on our side. The number of requests got reduced by 10x and it seems that is something we will have to target with a lot of conditionals and processors.

But the biggest concern is malformed exclude_files: ["*xyz*"] // MALFORMED REGEX. I think this is open to bad configs and can easily be missed as was the case with us. But if it potentially brings down the entire K8s, then it is pretty scary and needs to be corrected.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.