Kubernetes - Filebeat stops sending/picking up logs


(Matt Hemingway) #1

Hi! Running Filebeat 6.2.3 in Kubernetes with filebeat.autodiscover. Here's the filebeat.yml ConfigMap:

  filebeat.yml: |-
    filebeat.autodiscover:
      providers:
        - type: kubernetes
          in_cluster: true
          templates:
            - condition:
                or:
                  - equals:
                      kubernetes.namespace: development
                  - equals:
                      kubernetes.namespace: sqa
                  - equals:
                      kubernetes.namespace: test
                  - equals:
                      kubernetes.namespace: stage
              config:
                - type: docker
                  containers.ids:
                    - "${data.kubernetes.container.id}"
                  multiline.pattern: '^[[:space:]]'
                  multiline.negate: false
                  multiline.match: after

    output.logstash:
      hosts: ['${LOGSTASH_HOST}:${LOGSTASH_PORT}']

At some point, Filebeat seems to either stop polling K8s or loses the log files or just stops sending to Logstash. If I restart the Filebeat pods, it starts sending logs again.

Looking at the pod logs doesn't seem to indicate any major errors or anything. The only thing I see as an ERROR is this:

2018-04-18T15:56:46.755Z ERROR kubernetes/watcher.go:154 kubernetes: Watching API error EOF

The time those errors started don't sync up with when data stopped coming through though.

First error was here:

2018-04-13T20:58:04.867Z ERROR kubernetes/util.go:57 Querying for pod failed with error: %!(EXTRA string=Get https://10.233.0.1:443/api/v1/namespaces/kube-system/pods/filebeat-wqgsv: dial tcp 10.233.0.1:443: i/o timeout)

But logs stopped being sent on the 11th.

Any ideas on what's going on? Thank you!

-Matt


(Noémi Ványi) #2

Could you please share the debug logs of Filebeat?


(Matt Hemingway) #3

Do I just add a -d as an arg in the container spec in the k8s YAML file to enable debug logging?

It seems to run for a couple days before crapping out. I'm a bit concerned about that much debug messages being logged but it's pre-prod so whatevs (amirite? :slight_smile:


(Carlos Pérez Aradros) #4

Hi,

You can enable autodiscover related selectors only: -d kubernetes,autodiscover. Do you still have the full log? I'm interested on the lines happening after:

2018-04-18T15:56:46.755Z ERROR kubernetes/watcher.go:154 kubernetes: Watching API error EOF

(J√°nos Csorv√°si) #5

Hi @exekias and @kvch,

We also hit this with one of our heartbeat loggers that eventually produced an alert. Seems like after filebeat hits this bug, it basically stops harvesting pod logs.
The heartbeat logger is basically a simple kubernetes Deployment running

while true; do echo "{\"message\": \"Kubernetes checking in\", \"logger_type\": \"stdout\"}"; sleep 60; done

I have turned on debug logging now as you advised above, but I have wait for the bug to be triggered again to have those logs. In the meantime I dug around in kibana for filebeat logs, and found this excerpt (copied from kibana UI, so the order is reversed)

May 11th 2018, 11:01:41.358 INFO Prospector ticker stopped
May 11th 2018, 11:01:41.358 INFO Harvester started for file: /var/lib/docker/containers/c098839864276336ab59ffc1b44c1d1e8f4145d86a90b11e17d49a4331342641/c098839864276336ab59ffc1b44c1d1e8f4145d86a90b11e17d49a4331342641-json.log
May 11th 2018, 11:01:41.357 INFO Configured paths: [/var/lib/docker/containers/c098839864276336ab59ffc1b44c1d1e8f4145d86a90b11e17d49a4331342641/*.log]
May 11th 2018, 11:01:41.357 INFO Stopping Prospector: 8982987255257196772
May 11th 2018, 11:01:41.357 INFO Autodiscover starting runner: prospector [type=docker, ID=8755765163011943473]
May 11th 2018, 11:01:41.357 INFO Prospector ticker stopped
May 11th 2018, 11:01:41.357 WARN EXPERIMENTAL: Docker prospector is enabled.
May 11th 2018, 11:01:41.357 INFO Autodiscover stopping runner: prospector [type=docker, ID=8982987255257196772]
May 11th 2018, 11:01:41.356 INFO Configured paths: [/var/lib/docker/containers/109f3ef4093b8e32744f7da5d034e32aa41ab3375b67c83e58ba0aa87b6bef51/*.log]
May 11th 2018, 11:01:41.356 WARN EXPERIMENTAL: Docker prospector is enabled.
May 11th 2018, 11:01:41.356 INFO Autodiscover starting runner: prospector [type=docker, ID=16898863723061103725]
May 11th 2018, 11:01:41.350 INFO Stopping Prospector: 16898863723061103725
May 11th 2018, 11:01:41.350 WARN EXPERIMENTAL: Docker prospector is enabled.
May 11th 2018, 11:01:41.350 INFO Autodiscover stopping runner: prospector [type=docker, ID=16898863723061103725]
May 11th 2018, 11:01:41.350 INFO Prospector ticker stopped
May 11th 2018, 11:01:41.346 ERROR kubernetes: Watching API error proto: wrong wireType = 6 for field ServiceAccountName
May 11th 2018, 11:01:41.346 INFO kubernetes: Ignoring event, moving to most recent resource version
May 11th 2018, 11:01:41.346 INFO kubernetes: Watching API for pod events
May 11th 2018, 11:01:41.344 ERROR kubernetes: Watching API error EOF
May 11th 2018, 11:01:41.344 INFO kubernetes: Watching API for pod events
May 11th 2018, 11:00:09.649 INFO File is inactive: /var/lib/host/logs/auth.log. Closing because close_inactive of 5m0s reached.
May 11th 2018, 10:55:04.630 INFO Harvester started for file: /var/lib/host/logs/auth.log
May 11th 2018, 10:50:09.624 INFO File is inactive: /var/lib/host/logs/auth.log. Closing because close_inactive of 5m0s reached.
May 11th 2018, 10:45:04.610 INFO Harvester started for file: /var/lib/host/logs/auth.log
May 11th 2018, 10:40:09.607 INFO File is inactive: /var/lib/host/logs/auth.log. Closing because close_inactive of 5m0s reached.
May 11th 2018, 10:35:04.586 INFO Harvester started for file: /var/lib/host/logs/auth.log
May 11th 2018, 10:30:09.567 INFO File is inactive: /var/lib/host/logs/auth.log. Closing because close_inactive of 5m0s reached.
May 11th 2018, 10:25:04.557 INFO Harvester started for file: /var/lib/host/logs/auth.log
May 11th 2018, 10:22:14.546 INFO File is inactive: /var/lib/host/logs/auth.log. Closing because close_inactive of 5m0s reached.
May 11th 2018, 10:17:29.095 INFO File is inactive: /var/lib/docker/containers/286317a149ad1220a2dd99ece95807181c32736fac80151c27b49a6a4458e844/286317a149ad1220a2dd99ece95807181c32736fac80151c27b49a6a4458e844-json.log. Closing because close_inactive of 5m0s reached.
May 11th 2018, 10:15:04.536 INFO Harvester started for file: /var/lib/host/logs/auth.log

These lines stand out to me the most

May 11th 2018, 11:01:41.346 ERROR kubernetes: Watching API error proto: wrong wireType = 6 for field ServiceAccountName
May 11th 2018, 11:01:41.346 INFO kubernetes: Ignoring event, moving to most recent resource version
May 11th 2018, 11:01:41.346 INFO kubernetes: Watching API for pod events
May 11th 2018, 11:01:41.344 ERROR kubernetes: Watching API error EOF

Previously we encountered the problem with the infinite loop which was fixed in the PR https://github.com/elastic/beats/pull/6504, the same one that introduced the above log line Ignoring event, moving to most recent resource version. I hope this helps a bit :slight_smile:


(Carlos Pérez Aradros) #6

Thank you for sending that @Janos_Csorvasi,

I'm wondering, do you see any more messages like these after:

May 11th 2018, 11:01:41.344 INFO kubernetes: Watching API for pod events

?

Pod watcher should keep retrying while errors happen. So a partial stop may be normal as long as it recovers from it.


(J√°nos Csorv√°si) #7

Here is a gist containing the logs from the last 24h for that one specific filebeat instance, again, extracted from kibana: https://gist.github.com/jcsorvasi/8e5e6d0ce2e15c24f6c34d944a95f818

It contains numerous instances of Watching API for pod events and Ignoring event, moving to most recent resource version.

(Still waiting on the debug logs, it has not triggered yet after the redeployments)


(Carlos Pérez Aradros) #8

Thank you!

I see some activity after the ERROR + retry, do you stop receiving logs? It would be awesome to know when exactly that happens to focus on Filebeat logs at that time.


(J√°nos Csorv√°si) #9

Alright, here are some logs from around the time I got paged due to missing logs: https://gist.github.com/jcsorvasi/d442dc1f2c9076804c14229466cce140
The gist contains DEBUG level messages now, for brevity I have filtered out the Non-zero metrics in the last 30s lines, let me know if you need them.

A new piece of information is that it seems like all logs eventually reach their destination (show up in kibana), but very late. It seems to me that filebeat takes half an hour to recover from the error and start harvesting and the logs again. In the above excerpt, d76136a412b48c76b245633d118ec3690290c3a51107b709fc8455f8d595ae7f is the id of the heartbeat logger in question, whose messages we closely monitor.

For this instance of filebeat, the kubernetes: Watching API error EOF error happened 28 times over the course of 24 hours.

Edit: here is how the problem looks at scale. The cluster is running 30 nodes, with filebeat running as a daemonset and a heartbeat producer running as a daemonset as well (logs 1 line every 1 minute), so I would expect the graph to be perfectly smooth at count 30 up until the last minute.

Let me know if you'd like any more information or if there is any way I can help.


(Carlos Pérez Aradros) #10

Thank you for such a detailed report!

I'll have a look again, I think I got enough info by now, will ping back on results

Best regards


(J√°nos Csorv√°si) #11

Thank you, much appreciated!


(J√°nos Csorv√°si) #12

Hi @exekias,

Do you have any updates maybe?

Thanks,

J√°nos


(Carlos Pérez Aradros) #13

I'm suspecting this is fixed in 6.3, would it be possible to give that a try? There is a snapshot from 6.3 branch (yet unreleased) under exekias/filebeat:6.3.0.

Of course, I would recommend using this under a non production scenario, as this is a snapshot of an unreleased version.

Best regards


(J√°nos Csorv√°si) #14

Thanks @exekias, I deployed your version, so far so good, but it needs to run for a couple of hours before the bug usually manifests itself.

On a sidenote, initially I thought it was completely broken, because I did not find the test logs in the designated index, but then I found that 6.3.0 changes behaviour on how the annotations are extracted, which broke our parsing pipeline.

We set the following annotations in the pod spec:

annotations:
  logging.meltwater.io/system: "foundation"
  logging.meltwater.io/service: "producer"
  logging.meltwater.io/type: "nested_json"

This used to be extracted like this by filebeat (6.2.3)

"kubernetes": {
  "annotations": {
      "logging.meltwater.io/system": "foundation",
      "logging.meltwater.io/service": "producer",
      "logging.meltwater.io/type": "nested_json"
    }
  } 

Now it shows up like this

"kubernetes": {
  "annotations": {
    "logging": {
      "meltwater": {
        "io/type": "nested_json",
        "io/service": "producer",
        "io/system": "foundation"
      }
    }
  }
}

Not sure if this is intentional, but I definitely prefer the original version, it is much cleaner and namespaced annotations are quite widely used in kubernetes.

Thanks again and I'll get back to you when I have some new info!


(J√°nos Csorv√°si) #15

I've ran exekias/filebeat:6.3.0 for 4 hours with 30 instances (both filebeat and the test logger), the issue has not manifested itself in that time, so I'm hopeful for this release. :slight_smile: Any idea when it might come out?

I reverted for the weekend, we'll try to run a longer test next week.

Thanks for all the help!


(Carlos Pérez Aradros) #16

We don't publish release dates, but you can expect it in the following weeks. As for the behavior change in how we store metrics, that's expected, see: https://github.com/elastic/beats/pull/6490


(J√°nos Csorv√°si) #17

I see, thank you so much for your help!


(system) #18

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.