Filbeat on Kubernetes/Autodiscover stops sending logs


(Fernando Barbosa) #1

Hi everyone,

So we're having this problem for over a week now. We're running Filebeat on a Kubernetes cluster with Istio (not sure if relevant). The cluster has about ~50 nodes and ~700 pods.

Everything worked fine for months. Then about a week ago we had an issue in which some filebeat containers would stop collecting Kubernetes metadata, and the data would reach our ElasticSearch cluster without it. We couldn't quite figure out what was the issue, but we found this thread and we were getting both Watching API error EOF and i/o timeout errors. So we thought that since we had just added a lot more load in our cluster, these "timeouts" started to happen, breaking metadata collecting.

Anyhow we tried the recommended solution, and enabled autodiscover. Now the labeling problem is over. However, sometimes when a new Filebeat pod starts, or more likely after it restarts (due to node autoscaling most likely), the autodiscover stops working and stops sends logs altogether! What is more interesting is that we configured Filebeat to have a kubernetes autodiscover prospector and a simple "log" prospector (to collect system logs at /var/log). When this happens, the log prospector does collect logs and send them to our cluster, but only from /var/log/kube-proxy.log, ignoring all other logs at /var/log. Very weird.

Deleting the pod and letting the daemonset start a new one usually solves the problem.

Furthermore, I've checked several instances, and this behavior always follows this error at startup:

kubernetes/util.go:90 kubernetes: Querying for pod failed with error: %!(EXTRA string=performing request: Get https://100.64.0.1:443/api/v1/namespaces/kube-system/pods/filebeat-22bsc: dial tcp 100.64.0.1:443: i/o timeout)

So it seems that Filebeat tries to get pod some information about itself (metadata?), fails, and this somehow "silently" breaks the collector - except for /var/log/kube-proxy.log, that is. I think we may need to better tune our api-server to avoid such frequent timeout errors but there's definitely some undesired behavior from the Filebeat containers.

Here is my configmap. I've used the deployment from Filebeat Github as base and edited a few things.

Data
====
filebeat.yml:
----
filebeat.config:
  inputs:
    # Mounted `filebeat-inputs` configmap:
    path: ${path.config}/inputs.d/*.yml
    # Reload inputs configs as they change:
    reload.enabled: false
  modules:
    path: ${path.config}/modules.d/*.yml
    # Reload module configs as they change:
    reload.enabled: false

filebeat.autodiscover:
  providers:
    - type: kubernetes
      templates:
        - condition:
            not:
              equals:
                kubernetes.container.name: filebeat
          config:
            - type: docker
              containers.ids:
                - "${data.kubernetes.container.id}"
              multiline.pattern: '^npm\sERR\!|^Caused\sby|^Traceback|^[A-Z][a-zA-Z]+\:\s|^[[:space:]]'
              multiline.negate: false
              multiline.match: after

processors:
  - add_cloud_metadata:

output.file.enabled: false
output.elasticsearch:
  hosts: ['${ELASTICSEARCH_HOST}:${ELASTICSEARCH_PORT}']
  index: "logs-kube-filebeat-%{[beat.version]}-%{+yyyy.MM.dd}"
  pipeline: mypipeline

setup.template.name: "filebeat-%{[beat.version]}"
setup.template.pattern: "filebeat-%{[beat.version]}-*"

http.enabled: false
Data
====
kubernetes.yml:
----
- type: log
  enabled: true
  paths:
    - /var/log/*.log
    - /var/log/messages
    - /var/log/syslog

Any ideas? I'll try to add debug logs for kubernetes and autodiscover and disable the log collector for system logs, leaving only the autodiscover, and see if this helps.


(Pier-Hugues Pellerin) #2

Hello @fbcbarbosa What version of Filebeat you are running?

Also debug log would help in that case, from the code, I think it would useful to use the "config" and "kubernetes" selector, adding the following to you configuration would do that.

logging.level: debug
logging.selectors: ["config", "kubernetes" ]

(Fernando Barbosa) #3

Hey @pierhugues

I'm running docker.elastic.co/beats/filebeat:6.4.3.

I'll activate the debug logs and post when the error occurs again.


(Fernando Barbosa) #4

@pierhugues

So I got an error again (in at least one container, probably more). It's a bit different though. Now the container started, was collecting logs just fine, but in less than 3 minutes restarted. After restarting it stopped collecting logs. Here's the restart info:

    State:          Running
      Started:      Wed, 21 Nov 2018 08:57:16 -0200
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 21 Nov 2018 08:53:58 -0200
      Finished:     Wed, 21 Nov 2018 08:56:24 -0200
    Ready:          True

I've got similar errors before. I'm not sure why the container stopped; first I thought it could be the cluster autoscaler, but that wouldn't cause the container to restart, it should create a new pod instead (plus it's a daemon set, it should go down with the node). Here are the logs of the container before stopping; it justs gets a "Stopping filebeat" out of the blue:

$ kubectl logs -p filebeat-wmzpx
...
2018-11-21T10:56:24.084Z	INFO	beater/filebeat.go:437	Stopping filebeat
2018-11-21T10:56:24.084Z	INFO	[autodiscover]	cfgfile/list.go:126	Stopping 14 runners ...
2018-11-21T10:56:24.084Z	DEBUG	[autodiscover]	cfgfile/list.go:137	Stopping runner: input [type=docker, ID=1694951394572659149]
2018-11-21T10:56:24.084Z	INFO	input/input.go:149	input ticker stopped
2018-11-21T10:56:24.084Z	INFO	input/input.go:167	Stopping Input: 1694951394572659149
2018-11-21T10:56:24.084Z	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error context canceled
2018-11-21T10:56:24.084Z	INFO	kubernetes/watcher.go:258	kubernetes: Ignoring event, moving to most recent resource version
2018-11-21T10:56:24.084Z	INFO	kubernetes/watcher.go:233	kubernetes: Watching API for resource events stopped
2018-11-21T10:56:24.084Z	DEBUG	[autodiscover]	cfgfile/list.go:137	Stopping runner: input [type=docker, ID=12037470292381425722]
[...]
2018-11-21T10:56:24.261Z	INFO	[monitoring]	log/log.go:150	Uptime: 2m26.16569121s
2018-11-21T10:56:24.261Z	INFO	[monitoring]	log/log.go:127	Stopping metrics logging.
2018-11-21T10:56:24.261Z	INFO	instance/beat.go:389	filebeat stopped.

I could send you the rest, but didn't find anything interesting. Now just after the restart I get the error:

$ kubectl logs filebeat-wmzpx | grep ERR
2018-11-21T10:57:46.654Z	ERROR	kubernetes/util.go:90	kubernetes: Querying for pod failed with error: %!(EXTRA string=performing request: Get https://100.64.0.1:443/api/v1/namespaces/kube-system/pods/filebeat-wmzpx: dial tcp 100.64.0.1:443: i/o timeout)
2018-11-21T11:32:21.314Z	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error EOF
2018-11-21T11:32:21.422Z	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error proto: wrong wireType = 6 for field ServiceAccountName
2018-11-21T12:03:54.259Z	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error EOF
2018-11-21T12:54:51.547Z	ERROR	kubernetes/watcher.go:254	kubernetes: Watching API error EOF

If I search around it:

$ kubectl logs filebeat-wmzpx | grep -A 10 -B 10 "Querying for pod failed with error:"
2018-11-21T10:57:16.650Z	INFO	registrar/registrar.go:134	Loading registrar data from /usr/share/filebeat/data/registry
2018-11-21T10:57:16.650Z	INFO	[monitoring]	log/log.go:114	Starting metrics logging every 30s
2018-11-21T10:57:16.651Z	INFO	registrar/registrar.go:141	States Loaded from registrar: 14
2018-11-21T10:57:16.651Z	INFO	crawler/crawler.go:72	Loading Inputs: 0
2018-11-21T10:57:16.651Z	INFO	crawler/crawler.go:106	Loading and starting Inputs completed. Enabled inputs: 0
2018-11-21T10:57:16.651Z	WARN	[cfgwarn]	kubernetes/kubernetes.go:51	BETA: The kubernetes autodiscover is beta
2018-11-21T10:57:16.652Z	INFO	cfgfile/reload.go:141	Config reloader started
2018-11-21T10:57:16.652Z	INFO	cfgfile/reload.go:196	Loading of config files completed.
2018-11-21T10:57:16.652Z	INFO	kubernetes/util.go:86	kubernetes: Using pod name filebeat-wmzpx and namespace kube-system to discover kubernetes node
2018-11-21T10:57:46.652Z	INFO	[monitoring]	log/log.go:141	Non-zero metrics in the last 30s <redacted>
2018-11-21T10:57:46.654Z	ERROR	kubernetes/util.go:90	kubernetes: Querying for pod failed with error: %!(EXTRA string=performing request: Get https://100.64.0.1:443/api/v1/namespaces/kube-system/pods/filebeat-wmzpx: dial tcp 100.64.0.1:443: i/o timeout)
2018-11-21T10:57:46.654Z	DEBUG	[autodiscover]	autodiscover/autodiscover.go:81	Configured autodiscover provider: kubernetes
2018-11-21T10:57:46.654Z	INFO	autodiscover/autodiscover.go:102	Starting autodiscover manager
2018-11-21T10:57:46.654Z	INFO	kubernetes/watcher.go:180	kubernetes: Performing a resource sync for *v1.PodList
2018-11-21T10:57:46.803Z	INFO	kubernetes/watcher.go:194	kubernetes: Resource sync done
2018-11-21T10:57:46.803Z	INFO	kubernetes/watcher.go:238	kubernetes: Watching API for resource events
...

The [monitoring] logs seems fine (it goes on and on with Non-zero metrics in the last 30s), but no logs arrive in the cluster. Searching for the [autodiscover] tag gave something more interesting though:

$ kubectl logs filebeat-wmzpx | grep -e "\[autodiscover\]"
2018-11-21T10:57:46.654Z	DEBUG	[autodiscover]	autodiscover/autodiscover.go:81	Configured autodiscover provider: kubernetes

Only one line after restart. But before:

$ kubectl logs -p filebeat-wmzpx | grep -e "\[autodiscover\]"
2018-11-21T10:53:58.584Z	DEBUG	[autodiscover]	autodiscover/autodiscover.go:81	Configured autodiscover provider: kubernetes
2018-11-21T10:53:58.594Z	DEBUG	[autodiscover]	autodiscover/autodiscover.go:161	Got a start event: <redacted>
2018-11-21T10:53:58.594Z	DEBUG	[autodiscover]	autodiscover/autodiscover.go:234	Got a meta field in the event
2018-11-21T10:53:58.594Z	DEBUG	[autodiscover]	cfgfile/list.go:70	Starting reload procedure, current runners: 0
2018-11-21T10:53:58.594Z	DEBUG	[autodiscover]	cfgfile/list.go:88	Start list: 1, Stop list: 0
2018-11-21T10:53:58.595Z	DEBUG	[autodiscover]	cfgfile/list.go:109	Starting runner: input [type=docker, ID=1694951394572659149]

Thousands and thousands of lines. It seems after restart and the i/o timeout error, it wouldn't listen to kubernetes events.

Hope this helps debugging. I've saved the full logs in case you want to take a look at them @pierhugues. Thanks for taking interest in this by the way :slight_smile:

My guess? Some networking issue (of which an slow api-server is either a symptom or the cause) is causing filebeat to die silently; once it restarts if the issue is not gone, it will fail to collect some important metadata, which breaks autodiscovery (and without autodiscovery enabled, the collector would still work but this would break metadata collection). Of course I know next to nothing about filebeat's codebase so this is nothing more than a guess.