Metricbeat autodiscovery kubernetes

I'm running metricbeat 6.3.0 with autodiscovery enabled as daemonset on my K8s cluster in GCP as described in the documentation

The autodiscovery works fine once I attach the proper annotations to my pods. Anyway once I remove an annotation - or set the values from true to false the metricbeat seems to still try to connect to the olds POD address prior the annotation removal and POD recreation:

error making http request: Get https://10.48.10.74:9200/_nodes/_local: dial tcp 10.48.10.74:9200: getsockopt: no route to host

Here's the metricbeat configuration:

processors:
  - add_cloud_metadata: ~
  - add_kubernetes_metadata: ~
  - add_host_metadata: ~
  - add_docker_metadata:
      host: "unix:///var/run/docker.sock"

metricbeat.autodiscover:
  providers:
    - type: kubernetes
      include_annotations: ["healthmon.kibana","healthmon.elasticsearch"]
      templates:
        - condition:
            contains:
               kubernetes.annotations.healthmon.kibana: "true"
          config:
            - module: kibana
              metricsets: ["status"]
              enabled: true
              period: 10s
              hosts: ["https://${data.host}:${data.port}"]
              username: "${ES_MODULE_USER}"
              password: "${ES_MODULE_PASSWORD}"
              ssl.verification_mode: none
        - condition:
            contains:
               kubernetes.annotations.healthmon.elasticsearch: "true"
          config:
            - module: elasticsearch
              metricsets: ["node", "node_stats"]
              enabled: true
              period: 10s
              hosts: ["https://${data.host}:9200"]
              username: "${ES_MODULE_USER}"
              password: "${ES_MODULE_PASSWORD}"
              ssl.verification_mode: none

It seems that the metricbeat is recognizing the change - but does not update it's configuration / disables the metricset:

2018-07-10T10:07:41.872Z        ERROR   kubernetes/watcher.go:195       kubernetes: Watching API error EOF
2018-07-10T10:07:41.872Z        INFO    kubernetes/watcher.go:179       kubernetes: Watching API for resource events
2018-07-10T10:07:41.877Z        INFO    autodiscover/autodiscover.go:181        Autodiscover stopping runner: elasticsearch [metricsets=2]
2018-07-10T10:07:41.881Z        WARN    [cfgwarn]       node/node.go:37 BETA: The elasticsearch node metricset is beta
2018-07-10T10:07:41.882Z        WARN    [cfgwarn]       node_stats/node_stats.go:37     BETA: The elasticsearch node_stats metricset is beta
2018-07-10T10:07:41.882Z        INFO    autodiscover/autodiscover.go:144        Autodiscover starting runner: elasticsearch [metricsets=2]
2018-07-10T10:07:41.882Z        WARN    transport/tls.go:227    SSL/TLS verifications disabled.
2018-07-10T10:07:41.882Z        WARN    transport/tls.go:227    SSL/TLS verifications disabled.
2018-07-10T10:07:46.028Z        INFO    [monitoring]    log/log.go:124  Non-zero metrics in the last 30s      

Any idea for the cause of this problem / way how to dig deeper into this?

Uhm, you may be hitting a bug/limitation here. Let me understand the process you are following:

  1. Launch a pod annotated to match some of your conditions
  2. Remove the annotations, without stopping/restarting the containers
  3. Autodiscover fails to stop the module

Is that what you are experiencing? I may need to give this a try

Best regards

Almost correct.

I remove the annotation from the annotated containers - this results in a recreation of the POD / its containers. The only containers that are not being restarted are the metricbeat containers. Btw. restarting the metricbeat containers fixes the issue - but this shoudln't be required - right?

Autodiscover keeps the module for ~1m after the container is gone. Maybe this is what you are seeing? After that time it will cleanup the module and stop it.

Unfortunately it runs forever until the metricbeat daemonset is being restarted :confused:. It also reports the annotation back with the value true within the monitoring event / the elasticsearch document.

{
  "_index": "*****-6.3.0-2018.07.10",
  "_type": "doc",
  "_id": "Ov7lhWQBJCF7wZl_GKbv",
  "_score": 1,
  "_source": {
    "@timestamp": "2018-07-10T20:32:17.223Z",
    "metricset": {
      "name": "node",
      "module": "elasticsearch",
      "host": "*****:9200",
      "rtt": 3005901
    },
    "error": {
      "message": "error making http request: Get https://*****:9200/_nodes/_local: dial tcp *****:9200: getsockopt: no route to host"
    },
    "kubernetes": {
      "namespace": "database",
      "labels": {
        "app": "es-coord",
        "pod-template-hash": "3631151355"
      },
      "annotations": {
        "healthmon": {
          "elasticsearch": "true"
        }
      },
      "container": {
        "name": "es-c"
      },
      "pod": {
        "name": "es-coord-7b75595799-rdhdw"
      },
      "node": {
        "name": "gke-*****-*****-small-92458ff4-n642"
      }
    },
    "beat": {
      "version": "6.3.0",
      "name": "healthmon-w7qzl",
      "hostname": "healthmon-w7qzl"
    },
    "host": {
      "containerized": false,
      "name": "healthmon-w7qzl",
      "architecture": "x86_64",
      "os": {
        "platform": "centos",
        "version": "7 (Core)",
        "family": "redhat",
        "codename": "Core"
      }
    },
    "meta": {
      "cloud": {
        "instance_name": "gke-*****-*****-small-92458ff4-n642",
        "machine_type": "projects/*****/machineTypes/custom-6-20480",
        "availability_zone": "projects/*****/zones/europe-west3-a",
        "project_id": "<<>>>",
        "provider": "gce",
        "instance_id": "4070878436135109641"
      }
    }
  },
  "fields": {
    "@timestamp": [
      "2018-07-10T20:32:17.223Z"
    ]
  }
}

Could you let me know which commands / API calls would let me dig a bit deeper into the kubernetes side for those events? Eventually there's something stuck in k8s that leads to this behaviour.

Running Metricbeat with this debugging flags should give you a lot of info of what's going on:

-e -v -d autodiscover,kubernetes,bus

Please report any finding (you may need to remove sensitive data from the events before posting here)

Best regards

Thanks for the suggestions.

Here's the metricbeat debug log. Starting from the time when I did set the annotation healthmon.elasticsearch to false.

From 2018-07-12T12:40:51.405Zon the metricbeat did just repeatedly reports back the internal monitoring information - so nothing new regarding autodiscover.

Do you see anything strange going on here? Does the sidecar containers eventually cause issues here?

UPDATE: The sidecar is not an issue since another pod without sidecars has the same issue.

Any further ideas on this?

Hi @mat1010,

Sorry for the late response. I'm wondering if the sidecar containers are the issue here? With the given settings, you would launch a module for each container in the pod, including initContainers. Maybe that's the issue?

Like I stated in my previous post:

But that one might be a reason:

I will re-check this and report back.

HI @exekias, after updating the whole ELK stack the issue is gone for PODs that DON'T have any init containers. For the PODs WITH init containers the issues is still the same.

I was now also able to proof that it's the init container. All error messages are originating from requests to the init container. Do you know if there's away to attach an annotation only to a specific container rather then the whole POD? Haven't found anything yet.

Do you think it's worth creating a bug report for this? Or is this the intended behaviour?

You can ensure you launch the config for the right container by adding a condition that checks so. For instance:

 - type: kubernetes
      include_annotations: ["healthmon.kibana","healthmon.elasticsearch"]
      templates:
        - condition:
            contains:
               kubernetes.annotations.healthmon.kibana: "true"
               kubernetes.container.name: "yourdesiredcontainer"
          config:
            - module: kibana
              ...

Let me know if that works for you!

Here, we encounter this bug. The autodiscover module doesn't remove the deleted pods form its harvest list and try to poll them forever. It triggers errors of this type :
error making http request: Get http://10.1.27.81:8089/server-status?auto=: dial tcp 10.1.27.81:8089: connect: no route to host
This issue appeared in 6.3 and is still here in 6.4. I did a full test-case with the debug log. If anybody can help me to fix that, i can send the debug log via private message.

Thanks for your suggestion. Unfortunately this will make metricbeat less flexible because we now would either have to extend our metricbeat configuration with the relevant container names once we have something new or we would have to always name the container the same for a specific function. But this is not that huge of a drawback.

Now the annotation check is checking at least to correct container. But the issue is still ocurring.

After further tests and investigations, it seems the cause is not the same that mat1010.
Here, it happens only on pods which don't terminate immediately, with terminationGracePeriodSeconds <> 0. When deleting pod with active network connections, we see a start events after the stop event on the same endpoint. If I set terminationGracePeriodSeconds to 0, metricbeat deletes the pod correctly.

Here is debug logs when the bug happens : 
2018-09-14T13:49:05.320Z        DEBUG   [kubernetes]    kubernetes/kubernetes.go:111    Watcher Pod update: metadata:<name:"dc-u3-237-6qcq8" generateName:"dc-u3-237-"
2018-09-14T13:49:05.320Z        DEBUG   [bus]   bus/bus.go:72   metricbeat: map[host:10.1.27.81 port:8080 kubernetes:{"annotations":{"kubernetes":{"io/created-by":"{\
2018-09-14T13:49:05.320Z        DEBUG   [bus]   bus/bus.go:72   metricbeat: map[config:[0xc4209af8f0] stop:true host:10.1.27.81 port:8089 kubernetes:{"annotations":{"
2018-09-14T13:49:05.321Z        DEBUG   [autodiscover]  autodiscover/autodiscover.go:206        Got a stop event: map[stop:true host:10.1.27.81 port:8080 kubernetes:{
2018-09-14T13:49:05.321Z        DEBUG   [autodiscover]  autodiscover/autodiscover.go:206        Got a stop event: map[stop:true host:10.1.27.81 port:8089 kubernetes:{
2018-09-14T13:49:05.321Z        DEBUG   [bus]   bus/bus.go:72   metricbeat: map[start:true host:10.1.27.81 port:8080 kubernetes:{"annotations":{"kubernetes":{"io/crea
2018-09-14T13:49:05.322Z        DEBUG   [bus]   bus/bus.go:72   metricbeat: map[port:8089 kubernetes:{"annotations":{"kubernetes":{"io/created-by":"{\"kind\":\"Serial
2018-09-14T13:49:05.322Z        DEBUG   [autodiscover]  autodiscover/autodiscover.go:164        Got a start event: map[start:true host:10.1.27.81 port:8080 kubernetes
2018-09-14T13:49:05.322Z        DEBUG   [autodiscover]  autodiscover/autodiscover.go:164        Got a start event: map[start:true host:10.1.27.81 port:8089 kubernetes

Here is the steps to reproduce the issue :

  1. Start a pod which ignore SIGTERM.
    oc run mybash --image=rhel7 -l metricbeat=collect -- /usr/bin/bash -c "trap 'echo Got TERM' TERM ; while : ; do date ; sleep 1 ; done"

  2. configure the autodiscover module to watch pods with label metricbeat=collect

  3. Delete the pod : oc delete pod mybash ...

  4. Wait few seconds, the pod is terminating after 30s (default is terminationGracePeriodSeconds=30) and a new is created.

expected result :
Autodiscover stops watching the deleted pod

effective result :
Autodiscover goes on to watch the deleted pod forever. The pod exists no more and metricbeat triggers a error at each polling :
connect: no route to host

May I open a new issue on github ?

1 Like

I did some testing but couldn't reproduce this. Something that may be misleading is that we keep the module enabled for a while (exactly 60s) after it's deleted. This is particularly useful for logging use cases.

Could you please check if the module is still reporting 1 minute after the pod stopped? By the way, this can be configured with the cleanup_timeout: 0s passed as part of the autodiscover provider settings:

metricbeat.autodiscover:
  providers:
    - type: kubernetes
      ...
      cleanup_timeout: 0s

We upgraded in 6.4.1 and tried the setting cleanup_timeout to 0s. Unfortunately, we still have the issue. We have an open case at the elastic support (#00260837), but until now it doesn't help.

Important information :
To trigger the bug, the autodiscover module must already watched the pod once. Please, test again with only 2 nodes scheduled and try to delete the bash pod twice on each nodes.

I confirm that bug appeared in beats v6.3.