Can't collect kubernetes metrics from elastic-agent

Hello,

It's been days me and a colleague trying to configure elastic-agent to collect kubernetes. I'll explain our stack after this:

Before:

  • version of all elastic components: 8.11.4
  • kube-state-metrics properly deployed
  • environment : air-gapped, no internet connection to outside
  • elastic deployed on premise
  • kubernetes cluster deployed and pods working fine
  • Before we had everything working fine, we had elastic agent to collect system metrics and logs from VMs, and metricbeat was deployed using ECK to collect kubernetes metrics.

After

We had to replace metricbeat deployed earlier with elastic-agent of integration type kubernetes, and in addition of the already running elastic-agent on VMs that collect system metrics, I don't know if it's the correct way or not to have 2 agents running: one for system and the second one for kubernetes, or only the elastic-agent on kubernetes is enough to gather both system and kubernetes metrics.

The problem is that we are getting many errors once the elastic-agent is starting, the error saying:

{"log.level":"error","@timestamp":"2024-08-16T13:59:57.201Z","message":"Error
dialing x509: certificate signed by unknown
authority","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"kubernetes/metrics-default","type":"kubernetes/metrics"},"log":{"source":"kubernetes/metrics-default"},"log.logger":"esclientleg","log.origin":{"file.line":38,"file.name":"transport/logging.go"},"service.name":"metricbeat","network":"tcp","address":"m090-sup-prd1:9200","ecs.version":"1.6.0","ecs.version":"1.6.0"}

sometimes we are getting an error of: DNS lookup failure m090-sup-prd1 no such host, and the weird part is when metricbeat was deployed it was able to access the elasticsearch host on m090-sup-prd1.

below the configuration of elastic-agent:

---
# For more information https://www.elastic.co/guide/en/fleet/current/running-on-kubernetes-managed-by-fleet.html
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: elastic-agent
  namespace: kube-system
  labels:
    app: elastic-agent
spec:
  selector:
    matchLabels:
      app: elastic-agent
  template:
    metadata:
      labels:
        app: elastic-agent
    spec:
      # Tolerations are needed to run Elastic Agent on Kubernetes control-plane nodes.
      # Agents running on control-plane nodes collect metrics from the control plane components (scheduler, controller manager) of Kubernetes
      #      tolerations:
      #  - key: node-role.kubernetes.io/control-plane
      #    effect: NoSchedule
      #  - key: node-role.kubernetes.io/master
      #    effect: NoSchedule
      serviceAccountName: elastic-agent
      hostNetwork: true
      # 'hostPID: true' enables the Elastic Security integration to observe all process exec events on the host.
      # Sharing the host process ID namespace gives visibility of all processes running on the same host.
      hostPID: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
        - name: elastic-agent
          image: nexus.forge-dc.cloudmi.minint.fr/rrf/elk/agent/elasticagent:8.11.4
          env:
            # Set to 1 for enrollment into Fleet server. If not set, Elastic Agent is run in standalone mode
            - name: FLEET_ENROLL
              value: "1"
            # Set to true to communicate with Fleet with either insecure HTTP or unverified HTTPS
            - name: FLEET_INSECURE
              value: "true"
            # Fleet Server URL to enroll the Elastic Agent into
            # FLEET_URL can be found in Kibana, go to Management > Fleet > Settings
            - name: FLEET_URL
              value: "https://m090-sup-prd1:8220"
            # Elasticsearch API key used to enroll Elastic Agents in Fleet (https://www.elastic.co/guide/en/fleet/current/fleet-enrollment-tokens.html#fleet-enrollment-tokens)
            # If FLEET_ENROLLMENT_TOKEN is empty then KIBANA_HOST, KIBANA_FLEET_USERNAME, KIBANA_FLEET_PASSWORD are needed
            - name: FLEET_ENROLLMENT_TOKEN
              value: "RHVFVVVKRUJlQjRpdUo3V3U3Zl86emJLODdVWHJRZnFsczM1QXdRQURCdw=="
            - name: KIBANA_HOST
              value: "http://kibana:5601"
            # The basic authentication username used to connect to Kibana and retrieve a service_token to enable Fleet
            - name: KIBANA_FLEET_USERNAME
              value: "elastic"
            # The basic authentication password used to connect to Kibana and retrieve a service_token to enable Fleet
            - name: KIBANA_FLEET_PASSWORD
              value: "changeme"
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            # The following ELASTIC_NETINFO:false variable will disable the netinfo.enabled option of add-host-metadata processor. This will remove fields host.ip and host.mac.
            # For more info: https://www.elastic.co/guide/en/beats/metricbeat/current/add-host-metadata.html
            - name: ELASTIC_NETINFO
              value: "false"
          securityContext:
            runAsUser: 0
            # The following capabilities are needed for 'Defend for containers' integration (cloud-defend)
            # If you are using this integration, please uncomment these lines before applying.
            #capabilities:
            #  add:
            #    - BPF # (since Linux 5.8) allows loading of BPF programs, create most map types, load BTF, iterate programs and maps.
            #    - PERFMON # (since Linux 5.8) allows attaching of BPF programs used for performance metrics and observability operations.
            #    - SYS_RESOURCE # Allow use of special resources or raising of resource limits. Used by 'Defend for Containers' to modify 'rlimit_memlock'
            ########################################################################################
            # The following capabilities are needed for Universal Profiling.
            # More fine graded capabilities are only available for newer Linux kernels.
            # If you are using the Universal Profiling integration, please uncomment these lines before applying.
            #procMount: "Unmasked"
            #privileged: true
            #capabilities:
            #  add:
            #    - SYS_ADMIN
          resources:
            limits:
              memory: 700Mi
            requests:
              cpu: 100m
              memory: 400Mi
          volumeMounts:
            - name: proc
              mountPath: /hostfs/proc
              readOnly: true
            - name: cgroup
              mountPath: /hostfs/sys/fs/cgroup
              readOnly: true
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
            - name: varlog
              mountPath: /var/log
              readOnly: true
            - name: etc-full
              mountPath: /hostfs/etc
              readOnly: true
            - name: var-lib
              mountPath: /hostfs/var/lib
              readOnly: true
            - name: etc-mid
              mountPath: /etc/machine-id
              readOnly: true
            - name: sys-kernel-debug
              mountPath: /sys/kernel/debug
            - name: elastic-agent-state
              mountPath: /usr/share/elastic-agent/state
            # If you are using the Universal Profiling integration, please uncomment these lines before applying.
            #- name: universal-profiling-cache
            #  mountPath: /var/cache/Elastic
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: cgroup
          hostPath:
            path: /sys/fs/cgroup
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
        - name: varlog
          hostPath:
            path: /var/log
        # The following volumes are needed for Cloud Security Posture integration (cloudbeat)
        # If you are not using this integration, then these volumes and the corresponding
        # mounts can be removed.
        - name: etc-full
          hostPath:
            path: /etc
        - name: var-lib
          hostPath:
            path: /var/lib
        # Mount /etc/machine-id from the host to determine host ID
        # Needed for Elastic Security integration
        - name: etc-mid
          hostPath:
            path: /etc/machine-id
            type: File
        # Needed for 'Defend for containers' integration (cloud-defend) and Universal Profiling
        # If you are not using one of these integrations, then these volumes and the corresponding
        # mounts can be removed.
        - name: sys-kernel-debug
          hostPath:
            path: /sys/kernel/debug
        # Mount /var/lib/elastic-agent-managed/kube-system/state to store elastic-agent state
        # Update 'kube-system' with the namespace of your agent installation
        - name: elastic-agent-state
          hostPath:
            path: /var/lib/elastic-agent-managed/kube-system/state
            type: DirectoryOrCreate
        # Mount required for Universal Profiling.
        # If you are using the Universal Profiling integration, please uncomment these lines before applying.
        #- name: universal-profiling-cache
        #  hostPath:
        #    path: /var/cache/Elastic
        #    type: DirectoryOrCreate
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: elastic-agent
subjects:
  - kind: ServiceAccount
    name: elastic-agent
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: elastic-agent
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: kube-system
  name: elastic-agent
subjects:
  - kind: ServiceAccount
    name: elastic-agent
    namespace: kube-system
roleRef:
  kind: Role
  name: elastic-agent
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: elastic-agent-kubeadm-config
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: elastic-agent
    namespace: kube-system
roleRef:
  kind: Role
  name: elastic-agent-kubeadm-config
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: elastic-agent
  labels:
    k8s-app: elastic-agent
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - namespaces
      - events
      - pods
      - services
      - configmaps
      # Needed for cloudbeat
      - serviceaccounts
      - persistentvolumes
      - persistentvolumeclaims
    verbs: ["get", "list", "watch"]
  # Enable this rule only if planing to use kubernetes_secrets provider
  #- apiGroups: [""]
  #  resources:
  #  - secrets
  #  verbs: ["get"]
  - apiGroups: ["extensions"]
    resources:
      - replicasets
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources:
      - statefulsets
      - deployments
      - replicasets
      - daemonsets
    verbs: ["get", "list", "watch"]
  - apiGroups:
      - ""
    resources:
      - nodes/stats
    verbs:
      - get
  - apiGroups: [ "batch" ]
    resources:
      - jobs
      - cronjobs
    verbs: [ "get", "list", "watch" ]
  # Needed for apiserver
  - nonResourceURLs:
      - "/metrics"
    verbs:
      - get
  # Needed for cloudbeat
  - apiGroups: ["rbac.authorization.k8s.io"]
    resources:
      - clusterrolebindings
      - clusterroles
      - rolebindings
      - roles
    verbs: ["get", "list", "watch"]
  # Needed for cloudbeat
  - apiGroups: ["policy"]
    resources:
      - podsecuritypolicies
    verbs: ["get", "list", "watch"]
  - apiGroups: [ "storage.k8s.io" ]
    resources:
      - storageclasses
    verbs: [ "get", "list", "watch" ]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: elastic-agent
  # Should be the namespace where elastic-agent is running
  namespace: kube-system
  labels:
    k8s-app: elastic-agent
rules:
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs: ["get", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: elastic-agent-kubeadm-config
  namespace: kube-system
  labels:
    k8s-app: elastic-agent
rules:
  - apiGroups: [""]
    resources:
      - configmaps
    resourceNames:
      - kubeadm-config
    verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-agent
  namespace: kube-system
  labels:
    k8s-app: elastic-agent
---

we also tried to mount the CA as described in this link, but the elastic-agents pod are stuck in ContainerCreating and kubectl get events -n kube-system are showing the message: cannot mount volumes...:

// ...
          volumeMounts:
            ...
            - name: elastic-ca
              mountPath: /etc/ssl/certs/elastic-ca.crt
              subPath: elastic-ca.crt
              readOnly: true
      volumes:
        ...
        - name: elastic-ca
          secret:
            secretName: elastic-ca

we are lost, If someone can help would be awesome!

solved by this: Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile · Issue #3586 · elastic/elastic-agent · GitHub

The changes we made weren't impacted because elastic stores information on the host system and does not update it. The solution was, as suggested in the GitHub link, to delete the content of this path: rm—rf/var/lib/elastic-agent-managed/monitoring/state/* Applying the yml will result in applying new changes, and therefore, the elastic agent is started properly!