Filebeat autodiscover mode is flooding my kubernetes API

Hello,

I'm working on a managed kubernetes cluster with a cloud provider which offers limited K8S API performance (slow master nodes).

I have installed filebeat 8.5.1 with the official Helm chart.

They say that my filebeat configuration (deployed inside the cluster) relies too much on K8S API calls : it is flooding the API servers and admin task have response time problem. Even if there are only 3 worker nodes and relatively few logs (the app is not yet in production).

I already lowered the pressure with max_procs=1 and kube_client_options.qps=1. But this is not enough.

I don't really know how to go further, so I have a few questions :

  • are there some design error in this below filebeat.yml config file ?
  • do I risk loosing messages with kube_client_options.qps=1, or just losing additional kubernetes fields (which I don't care that much) ? Will I have log highlighting this problem if any ?
  • is autodiscover mandatory to achieve the filtering I'm trying to get ? If not, what is the legacy alternative ?
  • are there better architecture choices that could lower K8S API pressure, for example using logstash and filtering there ?

Many thanks in advance for any input :pray:

    max_procs: 1 # default = virtual CPU count (8 in production, too much for OVH API servers)
    filebeat.autodiscover:
      providers:
        - type: kubernetes
          kube_client_options:
            qps: 1
            burst: 10
          templates:
            - condition:
                or:
                  - equals.kubernetes.container.name: server # all produced myapp containers
                  - equals.kubernetes.container.name: teleport # same message start
              config:
                - type: container
                  paths:
                    - /var/log/containers/*-${data.kubernetes.container.id}.log
                  multiline:
                    pattern: "^[0-9]{4}-[0-9]{2}-[0-9]{2}" # starts with our date pattern
                    negate: true
                    match: after
                  processors:
                    - add_kubernetes_metadata:
                        host: ${NODE_NAME}
                        matchers:
                          - logs_path:
                              logs_path: "/var/log/containers/"
                    - dissect:
                        # searchable fields are defined here : https://www.elastic.co/guide/en/ecs/8.7/ecs-field-reference.html
                        # 2023-04-13T09:44:59.013Z  INFO user[1f4ba76d-c76f-4d4f-bd91-453b5313708d] 1 --- [io-8080-exec-10] c.p.b.m.t.a.LyraMessageBuilder: Start LyraMessageBuilder.build(..)
                        tokenizer: "%{event.start} %{log.level} user[%{process.real_user.id}] %{log.syslog.msgid} --- [%{log.syslog.procid}] %{log.origin.function}: %{event.reason}"
                        field: "message"
                        target_prefix: ""
                        ignore_failure: true
                        overwrite_keys: true
                        trim_values: "all" # values are trimmed for leading and trailing
            - condition: # other non my-app pods
                and:
                  - not.equals.kubernetes.container.name: apm-server
                  - not.equals.kubernetes.container.name: autoscaler # kube-dns-autoscaler
                  - not.equals.kubernetes.container.name: aws-cluster-autoscaler
                  - not.equals.kubernetes.container.name: calico-node
                  - not.equals.kubernetes.container.name: cert-manager-webhook-ovh
                  - not.equals.kubernetes.container.name: coredns
                  - not.equals.kubernetes.container.name: csi-snapshotter
                  - not.equals.kubernetes.container.name: filebeat
                  - not.equals.kubernetes.container.name: ingress-nginx-default-backend
                  - not.equals.kubernetes.container.name: logstash
                  - not.equals.kubernetes.container.name: metricbeat
                  - not.equals.kubernetes.container.name: pgadmin4
                  - not.equals.kubernetes.container.name: server # <-- above condition
                  - not.equals.kubernetes.container.name: teleport # <-- above condition
                  - not.equals.kubernetes.container.name: wormhole
              config:
                - type: container
                  paths:
                    - /var/log/containers/*-${data.kubernetes.container.id}.log
                  processors:
                    - add_kubernetes_metadata:
                        host: ${NODE_NAME}
                        matchers:
                          - logs_path:
                              logs_path: "/var/log/containers/"

Thank you for your time.

This a 3 worker nodes cluster with 15GB RAM each, so a very modest cluster.

These container names matches one pod per node for most of them. For the 'server' name, it is more, like 5 pods per node, at most.

Another question on top of the others : can I measure the K8S API pressure in some way ? Like by elevating the log level or something ?

yeah, for API usage, it's possible, I have a API server metrics dashboard in grafana that came with kube-prometheus-stack (helm-charts/apiserver.yaml at main · prometheus-community/helm-charts · GitHub).
and metricbeat has a module for apiserver in kubernetes that I haven't evaluated before: Kubernetes module | Metricbeat Reference [8.7] | Elastic

Thanks Brian, I'll look into that, it will take some time and I'm confident that this will show me the actual pressure. But not solve my problem, since the admins told me specifically that 98% of calls come from the filebeat pods.

Back to my initial questions, what could be improved to lower Kubernetes API pressure ?

Would the scope field help ? documentation says values are node or cluster.

Is it cluster by default, even if resource is pod by default ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.