We have filebeat version:
7.12.0 running as a daemonset on all our production K8s clusters. Recently, we faced a catastrophic situation where our whole K8s cluster with around 300+ nodes was brought down by filebeat.
We have configured filebeat with around 35 different
input.d configurations with most of them being either
type: container or
From one of those 35 input.d configs, we configured a wrong regex on
exclude_files which started to throw errors like:
Error creating runner from config: error parsing regexp: missing argument to repetition operator: `*` accessing 'exclude_files.0' (source:'/opt/filebeat/etc/inputs.d/abc.yml')
The configuration for that particular prospector was:
- type: log paths: - /var/log/containers/*_abc_*.log symlinks: true exclude_files: ["*xyz*"] // MALFORMED REGEX json.keys_under_root: false fields: topic: "abc" fields_under_root: true processors: - add_kubernetes_metadata: in_cluster: true default_matchers.enabled: false matchers: - logs_path: logs_path: /var/log/containers/
This bad regex configuration triggered hell lot of API calls to
kube-apiserver for kubernetes_metadata fetch, that it brought down the master node itself which eventually brought down the working of the kubernetes cluster itself.
How can we make filebeat better to handle these kind of misconfigs properly and consequently doesn't bring down the entire k8s?
What are the suggestions in general on the usage of
add_kubernetes_metadata? Coz I think if I compile all the different input.d configurations, into a single one with just one
add_kubernetes_metadata config, the number of requests to K8s API will significantly reduce and may help in not bringing down the entire K8s cluster. (Please let me know if that will help.)
Attaching some images to provide a gist on the increase in the number of K8s API requests.