I'm running metricbeat 6.3.0 with autodiscovery enabled as daemonset on my K8s cluster in GCP as described in the documentation
The autodiscovery works fine once I attach the proper annotations to my pods. Anyway once I remove an annotation - or set the values from true to false the metricbeat seems to still try to connect to the olds POD address prior the annotation removal and POD recreation:
error making http request: Get https://10.48.10.74:9200/_nodes/_local: dial tcp 10.48.10.74:9200: getsockopt: no route to host
I remove the annotation from the annotated containers - this results in a recreation of the POD / its containers. The only containers that are not being restarted are the metricbeat containers. Btw. restarting the metricbeat containers fixes the issue - but this shoudln't be required - right?
Autodiscover keeps the module for ~1m after the container is gone. Maybe this is what you are seeing? After that time it will cleanup the module and stop it.
Unfortunately it runs forever until the metricbeat daemonset is being restarted . It also reports the annotation back with the value true within the monitoring event / the elasticsearch document.
Could you let me know which commands / API calls would let me dig a bit deeper into the kubernetes side for those events? Eventually there's something stuck in k8s that leads to this behaviour.
Here's the metricbeat debug log. Starting from the time when I did set the annotation healthmon.elasticsearch to false.
From 2018-07-12T12:40:51.405Zon the metricbeat did just repeatedly reports back the internal monitoring information - so nothing new regarding autodiscover.
Do you see anything strange going on here? Does the sidecar containers eventually cause issues here?
UPDATE: The sidecar is not an issue since another pod without sidecars has the same issue.
Sorry for the late response. I'm wondering if the sidecar containers are the issue here? With the given settings, you would launch a module for each container in the pod, including initContainers. Maybe that's the issue?
HI @exekias, after updating the whole ELK stack the issue is gone for PODs that DON'T have any init containers. For the PODs WITH init containers the issues is still the same.
I was now also able to proof that it's the init container. All error messages are originating from requests to the init container. Do you know if there's away to attach an annotation only to a specific container rather then the whole POD? Haven't found anything yet.
Here, we encounter this bug. The autodiscover module doesn't remove the deleted pods form its harvest list and try to poll them forever. It triggers errors of this type : error making http request: Get http://10.1.27.81:8089/server-status?auto=: dial tcp 10.1.27.81:8089: connect: no route to host
This issue appeared in 6.3 and is still here in 6.4. I did a full test-case with the debug log. If anybody can help me to fix that, i can send the debug log via private message.
Thanks for your suggestion. Unfortunately this will make metricbeat less flexible because we now would either have to extend our metricbeat configuration with the relevant container names once we have something new or we would have to always name the container the same for a specific function. But this is not that huge of a drawback.
Now the annotation check is checking at least to correct container. But the issue is still ocurring.
After further tests and investigations, it seems the cause is not the same that mat1010.
Here, it happens only on pods which don't terminate immediately, with terminationGracePeriodSeconds <> 0. When deleting pod with active network connections, we see a start events after the stop event on the same endpoint. If I set terminationGracePeriodSeconds to 0, metricbeat deletes the pod correctly.
Start a pod which ignore SIGTERM.
oc run mybash --image=rhel7 -l metricbeat=collect -- /usr/bin/bash -c "trap 'echo Got TERM' TERM ; while : ; do date ; sleep 1 ; done"
configure the autodiscover module to watch pods with label metricbeat=collect
Delete the pod : oc delete pod mybash ...
Wait few seconds, the pod is terminating after 30s (default is terminationGracePeriodSeconds=30) and a new is created.
expected result :
Autodiscover stops watching the deleted pod
effective result :
Autodiscover goes on to watch the deleted pod forever. The pod exists no more and metricbeat triggers a error at each polling :
connect: no route to host
I did some testing but couldn't reproduce this. Something that may be misleading is that we keep the module enabled for a while (exactly 60s) after it's deleted. This is particularly useful for logging use cases.
Could you please check if the module is still reporting 1 minute after the pod stopped? By the way, this can be configured with the cleanup_timeout: 0s passed as part of the autodiscover provider settings:
We upgraded in 6.4.1 and tried the setting cleanup_timeout to 0s. Unfortunately, we still have the issue. We have an open case at the elastic support (#00260837), but until now it doesn't help.
Important information :
To trigger the bug, the autodiscover module must already watched the pod once. Please, test again with only 2 nodes scheduled and try to delete the bash pod twice on each nodes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.