Heartbeat auto discovery generates second monitor which is always down

Hi,

since we updated Heartbeat in our Kubernetes cluster from version 7.17.x to 8.3.3, we have a second monitor for every pod we have annotated.

Both monitors have the same name and the same tags. Just one has no URL and the following error in the details

io:job could not be initialized: hosts is a mandatory parameter accessing config

If I delete the pod both monitors vanished as they should and if I turn it on again, both a back.

Our config looks like this

cloud.id: ${ELASTIC_CLOUD_ID}
cloud.auth: ${ELASTIC_CLOUD_AUTH}

logging:
  level: ERROR
  to_stderr: true

heartbeat.autodiscover:
  # Autodiscover pods
  providers:
    - type: kubernetes
      resource: pod
      scope: cluster
      node: ${NODE_NAME}
      cleanup_timeout: 120
      hints.enabled: true

and the following annotations (changed the names and tags for security reasons) are set on the pods.

co.elastic.monitor/hosts: app.namespace.svc.cluster.local/api/health
co.elastic.monitor/name: app-name
co.elastic.monitor/schedule: @every 30s
co.elastic.monitor/tag.0: namespace
co.elastic.monitor/tag.1: app-name
co.elastic.monitor/timeout: 30s
co.elastic.monitor/type: http

Not really sure what we are doing wrong as it worked before for one year without problem.

Hi @kintetsu,

I'm sorry to hear that you're having problems. Would you be able to provide an example configuration of the pod being deployed? It would go a long way in helping us find the issue.

Please remember to remove any sensitive information before sharing it.

Hi @emilioalvap

here is the Pod configuration.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cattle.io/timestamp: "2022-09-30T15:15:29Z"
    co.elastic.logs.<service-name>-svc/enabled: "true"
    co.elastic.logs.<service-name>-svc/json.add_error_key: "true"
    co.elastic.logs.<service-name>-svc/json.keys_under_root: "true"
    co.elastic.logs.<service-name>-svc/json.message_key: message
    co.elastic.monitor/hosts: <service-name>.<namespace>.svc.cluster.local/health
    co.elastic.monitor/name: <service-name>
    co.elastic.monitor/schedule: '@every 30s'
    co.elastic.monitor/tags.0: <namespace>
    co.elastic.monitor/tags.1: <service-name>
    co.elastic.monitor/timeout: 30s
    co.elastic.monitor/type: http
    traffic.sidecar.istio.io/excludeOutboundPorts: "3306"
  creationTimestamp: "2022-10-03T10:06:39Z"
  generateName: <service-name>-8cd6f99d8-
  labels:
    app: <service-name>
    pod-template-hash: 8cd6f99d8
    version: v1
  name: <service-name>-8cd6f99d8-njpxn
  namespace: <namespace>
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: <service-name>-8cd6f99d8
    uid: 0a5857f3-9b5d-416f-a494-a7385920a95e
  resourceVersion: "82695198"
  uid: cff0442d-95ad-4867-8c8a-b03e82889e31
spec:
  containers:
  - env:
    - name: TZ
      value: UTC
    - name: ELASTIC_APM_SERVICE_NAME
      value: <service-name>
    - name: ELASTIC_APM_SERVER_URL
      valueFrom:
        secretKeyRef:
          key: serverUrl
          name: elastic-apm
    - name: ELASTIC_APM_SECRET_TOKEN
      valueFrom:
        secretKeyRef:
          key: secretToken
          name: elastic-apm
    - name: ELASTIC_APM_APPLICATION_PACKAGES
      value: <package>
    - name: ELASTIC_APM_SERVICE_VERSION
      value: latest
    - name: ELASTIC_APM_HOSTNAME
      value: <hostname>
    - name: ELASTIC_APM_ENVIRONMENT
      value: <environment>
    image: <image>:latest
    imagePullPolicy: Always
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8081
        scheme: HTTP
      initialDelaySeconds: 20
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 1
    name: <service-name>
    ports:
    - containerPort: 8081
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8081
        scheme: HTTP
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    resources:
      limits:
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 1Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: <name>
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: <secret>
  nodeName: <node>
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: <name>
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

And here is also the service configuration.

apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: <name>
    meta.helm.sh/release-namespace: <namespace>
  creationTimestamp: "2022-05-16T11:55:14Z"
  labels:
    app: <service-name>
    app.kubernetes.io/managed-by: Helm
    version: v1
  name: <service-name>
  namespace: <namspace>
  resourceVersion: "9322407"
  uid: 42c7846f-1cc5-4a2b-bb64-c67cf848bdcf
spec:
  clusterIP: <x.x.x.x>
  clusterIPs:
  - <x.x.x.x?
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 8081
  selector:
    app: <service-name>
  sessionAffinity: None
  type: ClusterIP

And here is the latest heartbeat configuration we are using. We also tried to remove the autodiscovery provider for the resource services without effect but when we remove the provider for pods, both monitors are gone. So we are pretty sure that the problem is not that we have it configured for services and pods.

cloud.id: ${ELASTIC_CLOUD_ID}
cloud.auth: ${ELASTIC_CLOUD_AUTH}

logging:
  level: ERROR
  to_stderr: true

heartbeat.autodiscover:
  # Autodiscover pods
  providers:
    - type: kubernetes
      resource: pod
      scope: cluster
      node: ${NODE_NAME}
      cleanup_timeout: 120
      hints.enabled: true
    - type: kubernetes
      resource: service
      scope: cluster
      node: ${NODE_NAME}
      cleanup_timeout: 120
      hints.enabled: true
    - type: kubernete
      resource: node
      node: ${NODE_NAME}
      scope: cluster
      templates:
        # Example, check SSH port of all cluster nodes:
        - condition: ~
          config:
            - hosts:
                - ${data.host}:22
              name: ${data.kubernetes.node.name}
              schedule: '@every 10s'
              timeout: 5s
              type: tcp

processors:
  - add_kubernetes_metadata:
  - add_cloud_metadata:

Hi @kintetsu,

This is a hypothesis, because I have no experience setting up Istio. But from the documentation it seems their plugin can automatically inject a sidecar container inside the pod:

traffic.sidecar.istio.io/excludeOutboundPorts: "3306"

If that's the case, unless you specifically filter it out, it will probably be picked up by autodiscovery.

Would you be able to check if that's the case by runinng describe pod on the deployed environment? The pod should only have a running container, according to the configuration you've provided.

Hi @emilioalvap,

we are no longer using istio, I removed the annotation but without any change.
I also checked with describe pod but only one container was running in the pod. No side cars.

Can we provide any other log? Or is it possible to get more information from heartbeat itself why it creates the second monitor?

@emilioalvap I also set the log level to debug and compared the publish events for both monitors.
They were only 1 millisecond one after the other.

The event for the failing monitor had under kubernetes a container object, and on top level an error and container object. The event for the succeeding monitor didn't have that.

Besides these differences they were identical. Also under kubernetes.pod they had the exact same pod.

If it is helpful I can remove all sensitive data and provide it as well.

Hi @kintetsu,

Sure, any information you can give us would certainly be helpful.

Hi @emilioalvap

Here is the event I see in the logs from heartbeat that shows the status down.

{
    "@timestamp": "2022-10-05T14:45:15.214Z",
    "@metadata": {
        "beat": "heartbeat",
        "type": "_doc",
        "version": "8.3.3"
    },
    "summary": {
        "up": 0,
        "down": 1
    },
    "tags": [
        "<tag0>",
        "<tag1>"
    ],
    "kubernetes": {
        "namespace": "<namespace>",
        "replicaset": {
            "name": "<service-name>-svc-5c7b8859c9"
        },
        "labels": {
            "version": "v1",
            "app": "<service-name>",
            "pod-template-hash": "5c7b8859c9"
        },
        "node": {
            "labels": {
                "beta_kubernetes_io/os": "linux",
                "kubernetes_io/arch": "amd64",
                "kubernetes_io/hostname": "<hostname>",
                "kubernetes_io/os": "linux",
                "node_kubernetes_io/instance-type": "k1.4x32",
                "beta_kubernetes_io/arch": "amd64",
                "beta_kubernetes_io/instance-type": "k1.4x32"
            },
            "hostname": "<hostname>",
            "name": "<hostname>",
            "uid": "<uuid>"
        },
        "namespace_uid": "<uuid>",
        "namespace_labels": {
            "field_cattle_io/projectId": "<project-id>",
            "kubernetes_io/metadata_name": "<namespace>"
        },
        "pod": {
            "ip": "<ip>",
            "name": "<service-name>-svc-5c7b8859c9-hsbqh",
            "uid": "48376104-987c-4ac2-952e-8dec9eef3bec"
        },
        "container": {
            "name": "<service-name>"
        }
    },
    "event": {
        "dataset": "http"
    },
    "cloud": {
        "availability_zone": "nova",
        "service": {
            "name": "ECS"
        },
        "provider": "huawei",
        "instance": {
            "id": "<uuid>"
        },
        "region": ""
    },
    "agent": {
        "ephemeral_id": "296eac54-671f-4269-8a83-be6476d8d67f",
        "id": "2e18b8f5-8d89-4ed0-bf71-844216f972f7",
        "name": "elastic-heartbeat-5ffd89f9bf-r6wht",
        "type": "heartbeat",
        "version": "8.3.3"
    },
    "observer": {
        "ip": [
            "<ipV4>",
            "<ipV6>"
        ],
        "mac": [
            "<mac>"
        ],
        "geo": {
            "region_name": "<region>",
            "region_iso_code": "<isocode>",
            "city_name": "<city>",
            "name": "<node-name>",
            "location": "<coordinates>",
            "continent_name": "<continent>",
            "country_iso_code": "<isocode>"
        },
        "hostname": "elastic-heartbeat-5ffd89f9bf-r6wht"
    },
    "monitor": {
        "timespan": {
            "gte": "2022-10-05T14:45:15.214Z",
            "lt": "2022-10-05T14:45:45.214Z"
        },
        "id": "auto-http-0X1322FD3945CDC8F5",
        "name": "<service-name>",
        "type": "http",
        "status": "down",
        "duration": {
            "us": 12
        },
        "check_group": "52ce46cd-44bc-11ed-ba01-fa5d5bf6d255"
    },
    "ecs": {
        "version": "8.0.0"
    },
    "error": {
        "type": "io",
        "message": "job could not be initialized: hosts is a mandatory parameter accessing config"
    },
    "container": {
        "id": "a0788016fda3174288fc9b8a9f459890a4a4e79386ddc853a7414480eab6b417",
        "runtime": "containerd",
        "image": {
            "name": "<repository>/<image>:latest"
        }
    }
}

And here is the event that shows up.

{
    "@timestamp": "2022-10-05T14:45:15.213Z",
    "@metadata": {
        "beat": "heartbeat",
        "type": "_doc",
        "version": "8.3.3"
    },
    "summary": {
        "up": 1,
        "down": 0
    },
    "tcp": {
        "rtt": {
            "connect": {
                "us": 345
            }
        }
    },
    "http": {
        "response": {
            "status_code": 200,
            "body": {
                "hash": "<bodyhash>",
                "bytes": 15
            },
            "mime_type": "application/json",
            "headers": {
                "Content-Type": "application/json",
                "Content-Length": "15",
                "Date": "Wed, 5 Oct 2022 14:45:15 GMT"
            }
        },
        "rtt": {
            "content": {
                "us": 126
            },
            "total": {
                "us": 2439
            },
            "write_request": {
                "us": 53
            },
            "response_header": {
                "us": 1909
            },
            "validate": {
                "us": 2035
            }
        }
    },
    "tags": [
        "<tag0>",
        "<tag1>"
    ],
    "kubernetes": {
        "namespace": "<namespace>",
        "replicaset": {
            "name": "<service-name>-svc-5c7b8859c9"
        },
        "labels": {
            "version": "v1",
            "app": "<service-name>",
            "pod-template-hash": "5c7b8859c9"
        },
        "node": {
            "labels": {
                "beta_kubernetes_io/os": "linux",
                "kubernetes_io/arch": "amd64",
                "kubernetes_io/hostname": "<hostname>",
                "kubernetes_io/os": "linux",
                "node_kubernetes_io/instance-type": "k1.4x32",
                "beta_kubernetes_io/arch": "amd64",
                "beta_kubernetes_io/instance-type": "k1.4x32"
            },
            "hostname": "<hostname>",
            "name": "<hostname>",
            "uid": "<uuid>"
        },
        "namespace_uid": "<uuid>",
        "namespace_labels": {
            "field_cattle_io/projectId": "<project-id>",
            "kubernetes_io/metadata_name": "<namespace>"
        },
        "pod": {
            "ip": "<ip>",
            "name": "<service-name>-svc-5c7b8859c9-hsbqh",
            "uid": "48376104-987c-4ac2-952e-8dec9eef3bec"
        }
    },
    "event": {
        "dataset": "http"
    },
    "resolve": {
        "ip": "<ip>",
        "rtt": {
            "us": 4650
        }
    },
    "url": {
        "scheme": "http",
        "domain": "<service-name>.<namespace>.svc.cluster.local",
        "port": 80,
        "path": "/health",
        "full": "http://<service-name>.<namespace>.svc.cluster.local/health"
    },
    "cloud": {
        "availability_zone": "nova",
        "service": {
            "name": "ECS"
        },
        "provider": "huawei",
        "instance": {
            "id": "<uuid>"
        },
        "region": ""
    },
    "agent": {
        "ephemeral_id": "296eac54-671f-4269-8a83-be6476d8d67f",
        "id": "2e18b8f5-8d89-4ed0-bf71-844216f972f7",
        "name": "elastic-heartbeat-5ffd89f9bf-r6wht",
        "type": "heartbeat",
        "version": "8.3.3"
    },
    "observer": {
        "ip": [
            "<ipV4>",
            "<ipV6>"
        ],
        "mac": [
            "<mac>"
        ],
        "geo": {
            "region_name": "<region>",
            "region_iso_code": "<isocode>",
            "city_name": "<city>",
            "name": "<node-name>",
            "location": "<coordinates>",
            "continent_name": "<continent>",
            "country_iso_code": "<isocode>"
        },
        "hostname": "elastic-heartbeat-5ffd89f9bf-r6wht"
    },
    "monitor": {
        "timespan": {
            "gte": "2022-10-05T14:45:15.220Z",
            "lt": "2022-10-05T14:45:45.220Z"
        },
        "id": "auto-http-0X7603BE60DBA32742",
        "name": "<service-name>",
        "type": "http",
        "status": "up",
        "duration": {
            "us": 7157
        },
        "check_group": "52ce07ed-44bc-11ed-ba01-fa5d5bf6d255",
        "ip": "<ip>"
    },
    "ecs": {
        "version": "8.0.0"
    }
}

Please let me know if I can provide more.

Hi @kintetsu,

Thanks for the data you've provided so far. I haven't been able to reproduce the issue locally yet, so I'd like to ask for some debug logs.

If you are able to get logging enabled in heartbeat, it would be helpful to get autodiscovery events that are being processed. You can enable these logs through the following heartbeat.yml configs:

logging.level: debug
logging.selectors: ["autodiscover.pod"]

We are interested in "Watcher Pod add" and "Generated builder event map" events, please remember to remove any sensitive data from these as well. At the same time, I'd encorauge you to look out for inconsistencies during this process, the problem could be reflected in some of the data you're filtering out.
If you could also temporarily remove service and node providers when extracting the logs, it would considerably reduce logging noise.

Hi @emilioalvap,

I have configured the logging and can see for my service there is one event with "Watcher Pod add" and two events with "Generated builder event map". One starts with "hints" and the other with "container" but also contains then the hints.

Is there anything in particular I can look out for? I'm not sure if it makes sense to remove sensitive data and post it here because I would remove so much information.

What I also saw was 2 events with "Generated hints". One time with "logs" and "monitor" and one time only with "monitor". We also use autodiscovery for filebeat and we user also a Java APM agent in the container.

Then I got also this event. Please note the %s in front of the endpoint which seems strange to me.

{
    "log.level": "warn",
    "@timestamp": "2022-10-07T12:47:04.849Z",
    "log.origin": {
        "file.name": "hints/monitors.go",
        "file.line": 151
    },
    "message": "unable to frame a host from input host: %s<servcice-name>.<namesapce>.svc.cluster.local/health",
    "service.name": "heartbeat",
    "ecs.version": "1.6.0"
}

And again I found the events for the 2 monitors.

{
    "log.level": "info",
    "@timestamp": "2022-10-07T12:47:04.856Z",
    "log.origin": {
        "file.name": "logger/logger.go",
        "file.line": 97
    },
    "message": "Monitor finished",
    "service.name": "heartbeat",
    "event": {
        "action": "monitor.run"
    },
    "monitor": {
        "id": "auto-http-0X1322FD3945CDC8F5",
        "type": "http",
        "duration": {
            "ms": 0
        }
    },
    "ecs.version": "1.6.0"
}
{
    "log.level": "info",
    "@timestamp": "2022-10-07T12:47:04.858Z",
    "log.origin": {
        "file.name": "logger/logger.go",
        "file.line": 97
    },
    "message": "Monitor finished",
    "service.name": "heartbeat",
    "event": {
        "action": "monitor.run"
    },
    "monitor": {
        "id": "auto-http-0X7603BE60DBA32742",
        "type": "http",
        "duration": {
            "ms": 4
        }
    },
    "ecs.version": "1.6.0"
}

Hi @kintetsu,

Thanks so much for providing the information requested! It's been enough for me to find the issue and be able to replicate locally.
Have you perhaps recently added the port section of pod config?

ports:
    - containerPort: 8081
      protocol: TCP

I think you're seeing an instance of this known issue: [Heartbeat] Hints based autodiscover ports settings · Issue #17771 · elastic/beats · GitHub. You'll need to implement a workaround by specifying ${data.port} somewhere in the host url:

co.elastic.monitor/hosts: test.kube-system.svc.cluster.local/health?ignore=${data.port}

Hope it helps!

1 Like

@emilioalvap

Thank you, that solves the problem for us.
I tried it with one service and only the valid monitor was created.

To your question, no we have not changed anything on the pod config since April 2022.

We have the Pod running on port 8081 and a Service forwarding from 80 to 8081.

I remember I stumbled upon ${data.port} when initially setting it up last year but we created a Services for the Pods and had no problems with 7.17.x. I don't know why the upgrade to 8.3.3 now triggers the bug. Anything we can improve on our configuration/setup?

Hi @kintetsu,

I'm glad you were able to fix the situation. If what you're trying to monitor is the service endpoint, then it probably makes sense to create hints for the service itself rather than the pod. When a pod is setup with multiple ports, we don't really know if the purpose is to test any/all of them so by default we treat all exposed prots from a pod as a potential monitor.

I was able to reproduce the issue on 7.17 and this section of heartbeat has not been modified on a while.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.