Not receiving State_* metrics in Azure/Kubernetes

Hi all,

I'm in the middle of setting up an ECK for a Kubernetes cluster on Azure, and I've hit a small snag on Metricbeat. Hoping someone can push me in the right direction.

The setup
I've got Elastic and related deployments running in a dedicated namespace on two dedicated nodes.
Both Elastic and Kibana are up and running without error. The initial deployment was done using the Elastic Operator for Kubernetes, and is secured accordingly.

The problem
I've added Metricbeat to the mix. Metricbeat uses a Daemonset to collect metrics from the pods/nodes/etc and a Deployment to collect cluster state statistics. After a bit of trial-and-error on the security part, this is working fine for the output from the Daemons, but I am not receiving anything from the state-metrics Deployment.

The Metricbeat pod for the Deployment seems to come up without any error and seems to be collecting metrics.

2019-12-04T08:07:13.195Z	INFO	instance/beat.go:292	Setup Beat: metricbeat; Version: 7.4.2
2019-12-04T08:07:13.196Z	INFO	elasticsearch/client.go:170	Elasticsearch url: https://elastic-es-http:9200
2019-12-04T08:07:13.196Z	INFO	[publisher]	pipeline/module.go:97	Beat name: metricbeat-54f645684f-xn646
2019-12-04T08:07:13.198Z	INFO	[monitoring]	log/log.go:118	Starting metrics logging every 30s
2019-12-04T08:07:13.198Z	INFO	instance/beat.go:422	metricbeat start running.
2019-12-04T08:07:13.198Z	INFO	cfgfile/reload.go:171	Config reloader started
2019-12-04T08:07:13.199Z	INFO	cfgfile/reload.go:226	Loading of config files completed.
2019-12-04T08:07:43.200Z	INFO	[monitoring]	log/log.go:145	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":20,"time":{"ms":27}},"total":{"ticks":100,"time":{"ms":110},"value":100},"user":{"ticks":80,"time":{"ms":83}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":8},"info":{"ephemeral_id":"126e8a4b-6dbd-459e-bfc1-93d1ecdb7b3d","uptime":{"ms":30263}},"memstats":{"gc_next":9569520,"memory_alloc":5648288,"memory_total":15310032,"rss":52699136},"runtime":{"goroutines":30}},"libbeat":{"config":{"module":{"running":0},"reloads":1},"output":{"type":"elasticsearch"},"pipeline":{"clients":0,"events":{"active":0}}},"system":{"cpu":{"cores":2},"load":{"1":0.43,"15":0.75,"5":0.58,"norm":{"1":0.215,"15":0.375,"5":0.29}}}}}}

I've restarted the kube-state-metrics pod on Kube-System just to be on the safe side. This too seems to start up without error:

I1203 16:21:25.679571       1 main.go:184] Testing communication with server
I1203 16:21:25.722699       1 main.go:189] Running with Kubernetes cluster version: v1.14. git version: v1.14.8. git tree state: clean. commit: 1da9875156ba0ad48e7d09a5d00e41489507f592. platform: linux/amd64
I1203 16:21:25.722726       1 main.go:191] Communication with server successful
I1203 16:21:25.722915       1 main.go:225] Starting metrics server: 0.0.0.0:8080
I1203 16:21:25.723261       1 main.go:200] Starting kube-state-metrics self metrics server: 0.0.0.0:8081
I1203 16:21:25.723348       1 metrics_handler.go:96] Autosharding disabled
I1203 16:21:25.724509       1 builder.go:144] Active collectors: certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,limitranges,namespaces,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses

All the shards on Elastic are status green, and I'm not seeing any connection or index errors (or any errors, for that matter) in the Elastic logs.

In short, all seems fine, except there's no State_* data coming in. In fact, if I delete the metricbeat Daemonset, no data is coming into the Metricbeat index at all, so nothing seems to be coming from the Deployment pod.

Configurations
The configuration for the Metricbeat Deployment is mostly standard, with added SSL/auth.

apiVersion: v1
kind: ConfigMap
metadata:
  name: metricbeat-deployment-config
  namespace: elastic
  labels:
    k8s-app: metricbeat
data:
  metricbeat.yml: |-
    metricbeat.config.modules:
      # Reload module configs as they change:
      reload.enabled: false
    processors:
      - add_cloud_metadata:
      - add_kubernetes_metadata:
         in_cluster: true
    setup.ilm.enabled: false
    output.elasticsearch:
      hosts: ['https://elastic-es-http:9200']
      ssl.certificate_authorities: ["/usr/share/elastic/certs/ca.crt"]
      ssl.certificate: '/usr/share/elastic/certs/tls.crt'
      ssl.key: '/usr/share/elastic/certs/tls.key'
      username: '{username}'
      password: "{password}"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: metricbeat-deployment-modules
  namespace: elastic
  labels:
    k8s-app: metricbeat
data:
  # This module requires `kube-state-metrics` up and running under `kube-system` namespace
  kubernetes.yml: |-
    - module: kubernetes
      labels.dedot: true
      annotations.dedot: true
      metricsets:
        - state_node
        - state_deployment
        - state_replicaset
        - state_pod
        - state_container
        - state_statefulset
        # Uncomment this to get k8s events:
        - event
      period: 10s
      hosts: ["kube-state-metrics.kube-system.svc.cluster.local:8080"]
      add_metadata: true
      in_cluster: true
      enabled: true
---
# Deploy singleton instance in the whole cluster for some unique data sources, like kube-state-metrics
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: metricbeat
  namespace: elastic
  labels:
    k8s-app: metricbeat
spec:
 template:
    metadata:
      creationTimestamp: ~
      labels:
        k8s-app: metricbeat
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: agentpool
                    operator: In
                    values:
                      - elastic
      containers:
        - args:
            - "-c"
            - /etc/metricbeat.yml
            - "-e"
          image: "docker.elastic.co/beats/metricbeat-oss:7.4.2"
          imagePullPolicy: IfNotPresent
          name: metricbeat
          resources:
            limits:
              memory: 200Mi
            requests:
              cpu: 100m
              memory: 100Mi
          securityContext:
            runAsUser: 0
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /usr/share/elastic/certs/
              name: elastic-internal-http-certificates
              readOnly: true
            - mountPath: /etc/metricbeat.yml
              name: config
              readOnly: true
              subPath: metricbeat.yml
            - mountPath: /usr/share/metricbeat/modules.d
              name: modules
              readOnly: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: metricbeat
      serviceAccountName: metricbeat
      terminationGracePeriodSeconds: 30
      tolerations:
        - effect: NoSchedule
          key: restriction
          operator: Equal
          value: elastic
      volumes:
        - configMap:
            defaultMode: 384
            name: metricbeat-deployment-config
          name: config
        - name: elastic-internal-http-certificates
          secret:
            defaultMode: 420
            optional: false
            secretName: elastic-es-http-certs-internal
        - configMap:
            defaultMode: 384
            name: metricbeat-deployment-modules
          name: modules

I initially ran the Deployment in the Kube-system namespace and the Daemonset in the Elastic namespace. As that didn't work, I've moved everything to the Elastic namespace now, and am targetting the kube-state-metrics services in Kube-system using "kube-state-metrics.kube-system.svc.cluster.local:8080".
In both cases the effect was the same. No visible errors in any of the logs, but no State_* metrics either.

Does anyone have any ideas what could be going on and/or have any suggestions on what I can do to try and isolate the cause?

Kind regards,
Chris

I did a bit of further digging and can confirm that Kube-State-Metrics is gathering the correct metrics (I can view the telemetry through a port-forward).

This leads me to believe the issue is that Metricbeat cannot correctly connect to either Kube-State-Metrics or Elastic.

By way of experiment, I've changed the 'Host' properties of the Metricbeat deployment for both the Elastic and Kube-State-Metrics services to non-existent endpoints, to see if that would throw any useful errors.

I have tried (in separate steps):

  • Invalid host for Kube-State-Metrics
  • Invalid host for Elastic
  • Invalid username/password for Elastic
  • Invalid certificate for Elastic

In all cases (including with 'Debug' logging activated), no errors are shown at all. The Metricbeat logs will simply show the 'Non-zero metrics' INFO messages.
Seemingly it does not matter whether I provide valid or invalid hosts or credentials.

  1. This strikes me as unexpected behavior. Can someone confirm that under normal circumstances, connection errors should be shown?
  2. Are there further steps I can take to debug/isolate the problem?

(Note: Elastic v7.5, Metricbeat v7.5)