Hi everyone,
So we're having this problem for over a week now. We're running Filebeat on a Kubernetes cluster with Istio (not sure if relevant). The cluster has about ~50 nodes and ~700 pods.
Everything worked fine for months. Then about a week ago we had an issue in which some filebeat containers would stop collecting Kubernetes metadata, and the data would reach our ElasticSearch cluster without it. We couldn't quite figure out what was the issue, but we found this thread and we were getting both Watching API error EOF
and i/o timeout
errors. So we thought that since we had just added a lot more load in our cluster, these "timeouts" started to happen, breaking metadata collecting.
Anyhow we tried the recommended solution, and enabled autodiscover. Now the labeling problem is over. However, sometimes when a new Filebeat pod starts, or more likely after it restarts (due to node autoscaling most likely), the autodiscover stops working and stops sends logs altogether! What is more interesting is that we configured Filebeat to have a kubernetes autodiscover prospector and a simple "log" prospector (to collect system logs at /var/log). When this happens, the log prospector does collect logs and send them to our cluster, but only from /var/log/kube-proxy.log, ignoring all other logs at /var/log. Very weird.
Deleting the pod and letting the daemonset start a new one usually solves the problem.
Furthermore, I've checked several instances, and this behavior always follows this error at startup:
kubernetes/util.go:90 kubernetes: Querying for pod failed with error: %!(EXTRA string=performing request: Get https://100.64.0.1:443/api/v1/namespaces/kube-system/pods/filebeat-22bsc: dial tcp 100.64.0.1:443: i/o timeout)
So it seems that Filebeat tries to get pod some information about itself (metadata?), fails, and this somehow "silently" breaks the collector - except for /var/log/kube-proxy.log
, that is. I think we may need to better tune our api-server to avoid such frequent timeout errors but there's definitely some undesired behavior from the Filebeat containers.
Here is my configmap. I've used the deployment from Filebeat Github as base and edited a few things.
Data
====
filebeat.yml:
----
filebeat.config:
inputs:
# Mounted `filebeat-inputs` configmap:
path: ${path.config}/inputs.d/*.yml
# Reload inputs configs as they change:
reload.enabled: false
modules:
path: ${path.config}/modules.d/*.yml
# Reload module configs as they change:
reload.enabled: false
filebeat.autodiscover:
providers:
- type: kubernetes
templates:
- condition:
not:
equals:
kubernetes.container.name: filebeat
config:
- type: docker
containers.ids:
- "${data.kubernetes.container.id}"
multiline.pattern: '^npm\sERR\!|^Caused\sby|^Traceback|^[A-Z][a-zA-Z]+\:\s|^[[:space:]]'
multiline.negate: false
multiline.match: after
processors:
- add_cloud_metadata:
output.file.enabled: false
output.elasticsearch:
hosts: ['${ELASTICSEARCH_HOST}:${ELASTICSEARCH_PORT}']
index: "logs-kube-filebeat-%{[beat.version]}-%{+yyyy.MM.dd}"
pipeline: mypipeline
setup.template.name: "filebeat-%{[beat.version]}"
setup.template.pattern: "filebeat-%{[beat.version]}-*"
http.enabled: false
Data
====
kubernetes.yml:
----
- type: log
enabled: true
paths:
- /var/log/*.log
- /var/log/messages
- /var/log/syslog
Any ideas? I'll try to add debug logs for kubernetes and autodiscover and disable the log collector for system logs, leaving only the autodiscover, and see if this helps.