Filebeat 6.8.1 Sporadically stops Sending Logs

We have four separate kubernetes clusters, two on-prem and two in AWS setup with filebeat to ship logs. For several months we have been troubleshooting an issue where sporadically our filebeat daemonsets will altogether stop shipping logs. We were on filebeat 6.2.x when this started, and ended up upgrading to 6.8.1 hoping this would help.

Often filebeat will continue shipping logs for an extended period of time, then suddenly will stop for several nodes. If we restart the filebeat daemonset it will immediately start shipping logs again, including the logs that were previously missed. At this point I am not sure what else to look at.

Memory/Cpu consumption seems to be normal during these outages.

System Version and component information:

  • Kubernetes 1.12.4
  • Filebeat 6.8.1
  • CNI: Kube-router
  • DNS provider: CoreDNS
  • RBAC: Enabled

Some errors we have seen:

ERROR   kubernetes/watcher.go:258       kubernetes: Watching API error read tcp x.x.x.x:59506->x.x.x.x:443: read: connection timed out


ERROR   kubernetes/watcher.go:248       kubernetes: Watching API error Get https://x.x.x.x:443/api/v1/pods?fieldSelector=spec.nodeName%3Dlg-l-p-obo00500&resourceVersion=&watch=true: dial tcp x.x.x.x:443: i/o timeout

ERROR   kubernetes/watcher.go:258       kubernetes: Watching API error EOF

ERROR   kubernetes/watcher.go:258       kubernetes: Watching API error read tcp x.x.x.x.184:39540->x.x.x.x:443: read: connection reset by peer  

ERROR   log/harvester.go:282    Read line error: invalid CRI log format; File: /var/lib/docker/containers/10e45645029f95adfdb1cee0c6341757e86d3c3115472d8076dc410fcb17eb30/10e45645029f95adfdb1cee0c6341757e86d3c3115472d8076dc410fcb17eb30-json.log

ERROR   log/harvester.go:282    Read line error: invalid CRI log format; File: /var/lib/docker/containers/10e45645029f95adfdb1cee0c6341757e86d3c3115472d8076dc410fcb17eb30/10e45645029f95adfdb1cee0c6341757e86d3c3115472d8076dc410fcb17eb30-json.log 

However the only error that reliably shows up on ALL nodes when the error occurs is this:

ERROR   kubernetes/watcher.go:258       kubernetes: Watching API error EOF

Filebeat Configuration:
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: filebeat
subjects:
- kind: ServiceAccount
name: filebeat
namespace: monitor
roleRef:
kind: ClusterRole
name: filebeat
apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: filebeat
labels:
k8s-app: filebeat
rules:
- apiGroups: [""] # "" indicates the core API group
resources:
- namespaces
- pods
verbs:
- get
- watch
- list
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: filebeat
namespace: monitor
labels:
k8s-app: filebeat
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: filebeat
namespace: monitor
labels:
app: filebeat
spec:
template:
metadata:
labels:
app: filebeat
name: filebeat
annotations:
lgi.io/team: "Platform"
spec:
serviceAccountName: filebeat
tolerations:
- operator: Exists
containers:
- name: filebeat
image: registry.x.x/beats/filebeat:6.8.1
imagePullPolicy: IfNotPresent
args: [
"-c", "/etc/filebeat/filebeat.yml",
"-e",
]
securityContext:
runAsUser: 0
resources:
limits:
memory: 500Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: config
mountPath: /etc/filebeat
readOnly: true
- name: data
mountPath: /usr/share/filebeat/data
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: varlog
mountPath: /var/log
terminationGracePeriodSeconds: 30
volumes:
- name: config
configMap:
defaultMode: 0600
name: filebeat
- name: data
# avoid data resubmission and store registy on a hostPath
hostPath:
path: /var/run/filebeat
- name: varlibdockercontainers
hostPath:
path: /app/containers
- name: varlog
hostPath:
path: /var/log
---
apiVersion: v1
kind: ConfigMap
metadata:
name: filebeat
namespace: monitor
data:
filebeat.yml: |+
filebeat.inputs:
- type: log
paths:
- /var/log/kube-.log
- type: log
paths:
- /var/log/syslog
- /var/log/messages
include_lines: ['kubelet([[0-9]])?:', 'kernel:', 'etcd:']
- type: docker
scan_frequency: 5s
close_inactive: 5m
containers.ids:
- "
"
processors:
- add_kubernetes_metadata:
in_cluster: true
labels.dedot: true
annotations.dedot: true
- drop_event:
# drop everything in kube-system, tools
# except nginx-ingress-controller, prometheus2, kube-scheduler, kube-controller-manager, flannel
when.and:
- not.or:
- equals:
. . .
- and:
- equals:
kubernetes.labels.app: x-service
- equals:
kubernetes.namespace: tools
- or:
- not.regexp:
kubernetes.namespace: ".*"
- equals:
kubernetes.namespace: kube-system
- equals:
kubernetes.namespace: monitor
- equals:
kubernetes.namespace: tools
- decode_json_fields:
when.regexp.message: '^{'
fields: ["message"]
target: ""
overwrite_keys: true
- drop_event:
when.or:
- equals: {level: DEBUG}
- drop_fields:
when.regexp.message: '^{'
fields: ["message", "timestamp"]
- dissect:
tokenizer: "%{namespace}"
field: "kubernetes.namespace"
target_prefix: ""
- rename:
fields:
. . .
multiline:
# Java exceptions and start scripts
pattern: '^[[:space:]]+(at|.{3})\b|^Caused by:|^+'
negate: false
match: after
fields_under_root: true

        output.kafka:
          # initial brokers for reading cluster metadata
          hosts: ["x.x.x.x:9092"]

          # message topic selection + partitioning
          topic: 'topic_name' # Changed in . . . .
          partition.round_robin:
            reachable_only: true

          required_acks: 0
          compression: gzip
          max_message_bytes: 8388608
          version: '0.10'
          worker: 5
          keep_alive: 30s
          channel_buffer_size: 2048

        logging.level: info