Hi,
We have filebeat version: 7.12.0
running as a daemonset on all our production K8s clusters. Recently, we faced a catastrophic situation where our whole K8s cluster with around 300+ nodes was brought down by filebeat.
We have configured filebeat with around 35 different input.d
configurations with most of them being either type: container
or type: log
.
From one of those 35 input.d configs, we configured a wrong regex on exclude_files
which started to throw errors like:
Error creating runner from config: error parsing regexp: missing argument to repetition operator: `*` accessing
'exclude_files.0' (source:'/opt/filebeat/etc/inputs.d/abc.yml')
The configuration for that particular prospector was:
- type: log
paths:
- /var/log/containers/*_abc_*.log
symlinks: true
exclude_files: ["*xyz*"] // MALFORMED REGEX
json.keys_under_root: false
fields:
topic: "abc"
fields_under_root: true
processors:
- add_kubernetes_metadata:
in_cluster: true
default_matchers.enabled: false
matchers:
- logs_path:
logs_path: /var/log/containers/
This bad regex configuration triggered hell lot of API calls to kube-apiserver
for kubernetes_metadata fetch, that it brought down the master node itself which eventually brought down the working of the kubernetes cluster itself.
How can we make filebeat better to handle these kind of misconfigs properly and consequently doesn't bring down the entire k8s?
What are the suggestions in general on the usage of add_kubernetes_metadata
? Coz I think if I compile all the different input.d configurations, into a single one with just one add_kubernetes_metadata
config, the number of requests to K8s API will significantly reduce and may help in not bringing down the entire K8s cluster. (Please let me know if that will help.)
Attaching some images to provide a gist on the increase in the number of K8s API requests.