Filebeat iowaits issue?

raulgs · February 18, 2021, 8:37am

Hi all,

I am running a k8s Cluster that is being suspended every night, due to cost savings, as no one works at this time on the cluster. Unfortunately it seems like fielbeat is not really getting along with that.

Every morning when the cluster resumes, filebeats runs into a super high iowait

Total DISK READ :      45.83 M/s | Total DISK WRITE :      57.28 K/s
Actual DISK READ:      45.83 M/s | Actual DISK WRITE:     481.16 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
17018 be/4 root       14.91 M/s    0.00 B/s  0.00 % 99.99 % filebeat -c /etc/filebeat.yml -e
17014 be/4 root       11.21 M/s    0.00 B/s  0.00 % 99.99 % filebeat -c /etc/filebeat.yml -e
17013 be/4 root        9.45 M/s    0.00 B/s  0.00 % 99.99 % filebeat -c /etc/filebeat.yml -e
17073 be/4 root        8.22 M/s    0.00 B/s  0.00 % 98.93 % filebeat -c /etc/filebeat.yml -e

Which kind of basically kills the nodes.
The log file looks a followed - a kind of stucks at this point:

}
2021-02-18T10:40:34.919Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/etcd-0_sensu-system_backup-73237576d5eedf6350b50d5e391870e19a787aa81b588ff7086b463b458be4f3.log; Backoff now.
2021-02-18T10:40:35.001Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/sensu-backend-0_sensu-system_sensu-backend-f591eb9332122a48029ee9bdfda858912c7a999ce5a46845066bd89c947ff82a.log; Backoff now.
2021-02-18T10:40:35.088Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/gitlab-runner-rzneo-945fb5db9-j79qw_rzneo_gitlab-runner-5eda7e332f09638aeff884789e8beb28745aa88d77b3219a495423011831955f.log; Backoff now.
2021-02-18T10:40:34.982Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/csi-gce-pd-node-29jhk_gce-pd-csi-driver_csi-driver-registrar-36a2be3a7c4533fe82c6c32d9b8d009035b1d672178d0069de00c7ea63a246b3.log; Backoff now.
2021-02-18T10:40:34.827Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/metricbeat-924kz_elastic-system_metricbeat-7f93ebbf4f0263063c911fc3e1f6fc7c353c4fcbfa6129edfcb7d1eb76935701.log; Backoff now.
2021-02-18T10:40:34.897Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/podsigner-84797b6596-mtznn_rzneo_podsigner-debafb6b31028c0994f0e7e9ee6ab0e23466c612bef8cf7f123ed8cd79ed2507.log; Backoff now.
2021-02-18T10:40:34.958Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/etcd-0_sensu-system_etcd-151b63a866c39e74b9ae55603759814f76f145237244b1c38b764e756412cac2.log; Backoff now.
2021-02-18T10:40:35.096Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/vault-54b7c466cd-xxjnq_rzneo_vault-42acca5910c3685cbb4390d50f93a9faec281e3fc56324315bcac4fc703c8a15.log; Backoff now.
2021-02-18T10:40:34.923Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/consul-2_rzneo_backup-fddf2a80f458ff9c882d97cf961fdb1f5c0cd76eb97b03940d0422d800c38e42.log; Backoff now.
2021-02-18T10:40:35.034Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/prometheus-server-1_sensu-system_prometheus-server-configmap-reload-8b6964e1c8b84e1bdd17f7af17d89f4e4afe9d1ea1bf7791672a87961612757c.log; Backoff now.
2021-02-18T10:40:34.972Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/prometheus-server-1_sensu-system_prometheus-server-12c5b393aea38c5e2b6834f09f15f292d6084d16e8fb1f9c7721ddb812cc77cc.log; Backoff now.
2021-02-18T10:40:34.827Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/rzneo-postgres-db-2_rzneo_postgres-54fd47be8a1516eedc8f25b1fe1587bf8502ae88aec572c01e830a584fa790f4.log; Backoff now.
2021-02-18T10:40:34.958Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/prometheus-server-1_sensu-system_rules-receiver-91da7da17dabe477dd7e400213dac67bfd75ad472d8aa6e14f2bd5296199ef17.log; Backoff now.
2021-02-18T10:40:35.119Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/prometheus-kube-state-metrics-c65b87574-dzt4q_sensu-system_kube-state-metrics-6f56c10ef6821f22415baf44e3c0e445879f12d06fbb37e04c1927b9ea66f726.log; Backoff now.
2021-02-18T10:40:34.883Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/kube-metrics-adapter-6768b9ddb8-5bs5t_sensu-system_kube-metrics-adapter-030cffee12aa3ab77785d6d0ad4358065f2e9d2901f49057dc870cfc900b507e.log; Backoff now.
2021-02-18T10:40:35.177Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/elastic-es-masterdata-100-1_elastic-system_elasticsearch-94037a69325502cab16d78ad610c7c957cda2c105da2ae0ba1193b5520ac1ac9.log; Backoff now.
2021-02-18T10:40:34.917Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/kube-proxy-tm247_kube-system_kube-proxy-6cbca54fba7f7cd3c11d9887765e3ede46464460825aa7ae3018fadf40d92fa8.log; Backoff now.
2021-02-18T10:40:35.202Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/vault-54b7c466cd-xxjnq_rzneo_auto-unseal-b1e3578c56336c2f7efed10a5b87b6f60153225199710761043e57cac32d7856.log; Backoff now.
2021-02-18T10:40:35.202Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/sensu-backend-0_sensu-system_backup-fc09946c834effbcfdda93cb85fb9928662ebd56e07e8f869a0648498e345bc6.log; Backoff now.
2021-02-18T10:40:35.202Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/vault-token-service-7454ccd668-dkwjs_rzneo_vault-token-service-6ebae1fae717b0553c7ced08a96799fe6ee06ac229e6ca3e43e5f3dba033f835.log; Backoff now.
2021-02-18T10:40:35.210Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/cerebro-66bdf4ddbf-grs7v_elastic-system_cerebro-6cd3f0ca9ffcb604f1546e12ffb4b17d5e3fb56daa03049545f7729ca3dc0c1e.log; Backoff now.
2021-02-18T10:40:35.308Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/rzneo-postgres-db-2_rzneo_exporter-6145efb8632c65b6ee215097e0f2bb5e1c7807b788b324b533e1897cab3f3f0e.log; Backoff now.
2021-02-18T10:40:35.326Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/elastic-es-masterdata-100-1_elastic-system_elasticsearch-94037a69325502cab16d78ad610c7c957cda2c105da2ae0ba1193b5520ac1ac9.log; Backoff now.
2021-02-18T10:40:35.406Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/elastic-es-masterdata-100-1_elastic-system_elasticsearch-94037a69325502cab16d78ad610c7c957cda2c105da2ae0ba1193b5520ac1ac9.log; Backoff now.
2021-02-18T10:40:35.424Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/elastic-es-masterdata-100-1_elastic-system_elasticsearch-94037a69325502cab16d78ad610c7c957cda2c105da2ae0ba1193b5520ac1ac9.log; Backoff now.
2021-02-18T10:40:35.424Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/prometheus-node-exporter-b4ssb_sensu-system_prometheus-node-exporter-f3c3db858cfdc98ea3a2c01fb983fdd92dcf2b46ab7cd9f725af14b17e3eb9df.log; Backoff now.
2021-02-18T10:40:35.424Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/consul-2_rzneo_consul-692a39410c818a782395a047ec6719a363481c971f66087316ec6cc22599c7bb.log; Backoff now.
2021-02-18T10:40:35.504Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/rzneo-postgres-operator-5994567dcd-fhvbc_rzneo_postgres-operator-1dc7de20a664aa85f3c7dc70e6e32e51d21ec8f1b9b47d1bb37bedfc0d19eeba.log; Backoff now.
2021-02-18T10:40:35.564Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/grafana-86d566c885-z4mrm_sensu-system_grafana-bd4d6f8be158db308ab7836fa91804e446fb4cc2cc18b85e42f6f7c73b8b85e5.log; Backoff now.
2021-02-18T10:40:35.572Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/elastic-es-masterdata-100-1_elastic-system_elasticsearch-94037a69325502cab16d78ad610c7c957cda2c105da2ae0ba1193b5520ac1ac9.log; Backoff now.
2021-02-18T10:40:35.585Z        DEBUG   [harvester]     log/log.go:107  End of file reached: /var/log/containers/csi-gce-pd-node-29jhk_gce-pd-csi-driver_gce-pd-driver-18a1e8ca350a5e72790e79a2c50e962bc04ee68ffe061e8b8e6fbcfe5dc1cf41.log; Backoff now.
2021-02-18T10:40:37.601Z        DEBUG   [reader_multiline]      multiline/pattern.go:170        Multiline event flushed because timeout reached.
2021-02-18T10:41:15.970Z        DEBUG   [input] input/input.go:139      Run input

Uninstalling and reinstalling filebeat fixes the issue.
However, this is not a long term solution.

filebeat test output

sh-4.2# filebeat test output -c /etc/filebeat.yml
elasticsearch: https://elastic-es-http.elastic-system.svc:9200...
  parse url... OK
  connection...
    parse host... OK
    dns lookup... OK
    addresses: 10.101.39.28
    dial up... OK
  TLS...
    security... WARN server's certificate chain verification is disabled
    handshake... OK
    TLS version: TLSv1.3
    dial up... OK
  talk to server... OK
  version: 7.11.0

I would be very happy is someone could help us with that.

jsoriano · February 19, 2021, 12:28pm

Hey @raulgs,

Thanks for reporting this, I think it is worth investigating.

How do you "suspend" the cluster?

With this you mean that you delete and re-create the pods?

raulgs · February 19, 2021, 3:44pm

Hi @jsoriano
thanks for offering your help.
We normally just go into the google console and suspend the instance.
What we have found out doing a deeper look is the following:

When the cluster resumes the filebeat pods need somewhat around 300Mi of RAM. This is causing a OOM kill as the limit for the pods is set to 200Mi. As soon as the OOM kill happens the pods start to run crazy causing a very high iowait which makes the node almost unusable. Not sure what exactly is causing this. Maybe some memory leak or so. The only way to fix this is to run for example a kubectl rollout restart daemonset filebeat.

Does this make any sense to you?

jsoriano · February 19, 2021, 4:04pm

Interesting, are other pods restarted after resuming the instances? Filebeat could need more memory during some time if there is extra activity in the cluster.

What looks strange to me is that it misbehaves after being restarted because of the OOM, but it doesn't after a controlled restart.

How much memory is Filebeat using on these nodes during normal operations? If it is close to the limit, and resuming the instances generares additional activity, Filebeat could need a bit more memory to recover. Try to increase the memory limit to something like 400Mi and check if the behaviour improves.

raulgs · February 19, 2021, 6:59pm

After resuming the instances basically all pods restart as the docker and kubelet service is being restarted - at least it looks like it as the time that the services are running looks like it. However, we are not seeing issues with any other pods. We also have Metricbeat deployed on the same instances - but no issues so far. Increasing the memory limit for Filebeat fixed the issue. Still we are a bit concerned now that this could happen again out of nowhere and kill our cluster

Thats correct - it seems like after being killed by the OOM it does not get stopped correctly, maybe some caches are not being cleared or pointer not being reset - something that could be different when doing a clean restart.

During normal operations I see the following stats:

filebeat-4fr5k                67m          148Mi
filebeat-5r5lf                11m          161Mi
filebeat-6ctdk                6m           110Mi
filebeat-7shdg                56m          136Mi
filebeat-8hlw7                6m           121Mi
filebeat-97l2n                7m           137Mi

400Mi is what I have defined as new limit yesterday. This way it seems to work properly so far.

jsoriano · February 22, 2021, 8:03am

This could trigger a lot of changes in the state of filebeat, that could lead to a temporary increase of cpu and memory usage, but I wouldn't expect this to cause further problems. Maybe the issue is related to be memory constrained, especially if the situation improves increasing the memory limit.

Could you share the manifest you are using to deploy filebeat, including filebeat configuration?

How many pods are deployed on the nodes with these problems?

raulgs · February 22, 2021, 8:22am

That could indeed. However, I have no influence on how the services restart while resuming.
We have a 6 node cluster (3 master, 3 worker). Filebeat is deployed on all of them.

Below the filebeat manifest.

# Source: filebeat/templates/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: elastic-system
  labels:
    k8s-app: filebeat
data:
  filebeat.yml: |-
    output.elasticsearch:
      hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
      username: ${ELASTICSEARCH_USERNAME}
      password: ${ELASTICSEARCH_PASSWORD}
      protocol: "https"
    output.elasticsearch.ssl.verification_mode: "none"
    output.elasticsearch.index: "%{[kubernetes.namespace]}-filebeat-%{+xxxx.ww}"

    setup.template:
      enabled: true
      name: "filebeat-%{[agent.version]}"
      pattern: "*-filebeat-*"
      order: 150
      overwrite: true

    #logging.level: debug
    setup.ilm.enabled: false
    
    http:
      enabled: true
      host: localhost
      port: 5066

    filebeat.autodiscover:
      providers:
      - type: kubernetes
        node: ${NODE_NAME}
        cleanup_timeout: 300s # inactivity before stopping the running configuration for a container
        hints.enabled: true
        hints.default_config:
          type: container
          paths: 
            - /var/log/containers/*-${data.kubernetes.container.id}.log
          multiline.pattern: '^[[:space:]]'
          multiline.negate: false
          multiline.match: after
          exclude_lines: ["^\\s+[\\-`('.|_]"]  # drop asciiart lines

    processors:
      - add_host_metadata:
          netinfo.enabled: false
      - add_cloud_metadata:
      - add_kubernetes_metadata:
          host: ${NODE_NAME}
          matchers:
          - logs_path:
              logs_path: "/var/log/containers/"
      - decode_json_fields:
          fields: ["message"]
          process_array: false
          max_depth: 1
          target: ""
          overwrite_keys: true
          add_error_key: true
      - drop_fields:
          fields:
            [
              "agent.ephemeral_id",
              "agent.hostname",
              "agent.id",
              "agent.type",
              "agent.name",
              "agent.version",
              "ecs.version",
              "host.id",
              "host.name",
              "cloud.account.id",
              "cloud.instance.id",
              "cloud.machine.type",
              "cloud.project.id",
              "input.type",
              "log.offset",
            ]
          ignore_missing: true
      - drop_event: #namespaces to be excluded from logging
          when.or:
          - equals.kubernetes.namespace: "logging"
          - equals.kubernetes.namespace: "monitoring"
          - equals.kubernetes.namespace: "kubernetes-dashboard"
          - equals.kubernetes.namespace: "test"
          - equals.kubernetes.namespace: "default"

# Source: filebeat/templates/filebeat.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat
  namespace: elastic-system
  labels:
    k8s-app: filebeat
spec:
  selector:
    matchLabels:
      k8s-app: filebeat
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        k8s-app: filebeat
      annotations:
        checksum/config: 67ab3aa2956c2bd2fc7dcd02003ec638d000c85665c18820af71b3575097520d #automatically roll on configmap change
        vault.secrets: filebeat-secret
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      serviceAccountName: filebeat
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: filebeat
        image: docker.elastic.co/beats/filebeat:7.11.1
        args: [
          "-c", "/etc/filebeat.yml",
          "-e",
        ]
        ports:
        - containerPort: 5066
        env:
        - name: ELASTICSEARCH_HOST
          value: "elastic-es-http.elastic-system.svc"
        - name: ELASTICSEARCH_PORT
          value: "9200"
        - name: ELASTICSEARCH_USERNAME
          valueFrom:
            secretKeyRef:
              name: filebeat-secret
              key: username
        - name: ELASTICSEARCH_PASSWORD
          valueFrom:
            secretKeyRef:
              name: filebeat-secret
              key: password
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        securityContext:
          runAsUser: 0
          privileged: false
#        livenessProbe:
#          exec:
#            command:
#              - sh
#              - -c
#              - |
#                #!/usr/bin/env bash -e
#                /usr/bin/timeout 5 curl --fail 127.0.0.1:5066
#          failureThreshold: 3
#          initialDelaySeconds: 10
#          periodSeconds: 10
#          timeoutSeconds: 5
#        readinessProbe:
#          exec:
#            command:
#              - sh
#              - -c
#              - |
#                #!/usr/bin/env bash -e
#                /usr/bin/timeout 5 filebeat test output -c /etc/filebeat.yml
#          failureThreshold: 3
#          initialDelaySeconds: 10
#          periodSeconds: 10
#          timeoutSeconds: 5
        resources:
          requests:
            cpu: "100m"
            memory: "100Mi"
          limits:
            cpu: "1000m"
            memory: "400Mi"
        volumeMounts:
        - name: config
          mountPath: /etc/filebeat.yml
          readOnly: true
          subPath: filebeat.yml
        - name: data
          mountPath: /usr/share/filebeat/data
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varrundockersock
          mountPath: /var/run/docker.sock
          readOnly: true
      volumes:
      - name: config
        configMap:
          defaultMode: 0600
          name: filebeat-config
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: varlog
        hostPath:
          path: /var/log
      - name: varrundockersock
        hostPath:
          path: /var/run/docker.sock
      # data folder stores a registry of read status for all files, so we don't send everything again on a Filebeat pod restart
      - name: data
        hostPath:
          # When filebeat runs as non-root user, this directory needs to be writable by group (g+w).
          path: /var/lib/filebeat-data
          type: DirectoryOrCreate

jsoriano · February 22, 2021, 8:24am

How many pods are there running in the cluster?

raulgs · February 22, 2021, 8:43am

Right now we have 97 active pods

jsoriano · February 22, 2021, 8:47am

Thanks for the details! I will try to reproduce the issue.

raulgs · February 22, 2021, 8:48am

Thanks looking forward for your outcome.

jsoriano · February 22, 2021, 8:57am

Sorry, another question regarding this part. When Filebeat is OOM killed, does it actually die and is restarted? or it just keeps running consuming more resources?
Is the restart count of kubernetes increased?

raulgs · February 22, 2021, 9:13am

As far as I remember the restart was counted in kubernetes.
According to the logs of the pods I could see it had restarted.
However it stopped at some point, without dropping any specific error.

jsoriano · February 22, 2021, 9:33am

And after this, filebeat was still affecting the node?

raulgs · February 22, 2021, 10:27am

Yes at this point it looks like shown in my first post.

jsoriano · March 18, 2021, 1:16pm

Quickly tried to reproduce with a kind cluster with multiple nodes, freezing the nodes (instead of suspending), but didn't see anything like this.

system · April 15, 2021, 3:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat causing a very large iowait and lagging after uncontrolled reboot Beats filebeat	2	374	March 11, 2024
Filebeat consumes a large amount of disk io reads on a Kubernetes node Beats docker , filebeat	7	2016	June 20, 2019
High DiskIO with Proxmox + Ceph Beats filebeat	2	1106	April 23, 2019
Filebeat daemonset losing messages Beats docker , filebeat	2	811	June 22, 2022
Filebeat Pod not working properly Beats docker , filebeat , metricbeat	4	1095	March 10, 2021

Filebeat iowaits issue?

Related topics