metricbeat continues polling the actuator endpoint but logs errors: unable to decode response from prometheus endpoint: error making http request: Get https://xxx.xxx.x.x:1443/actuator/prometheus: dial tcp xxx.xxx.x.x:1443: i/o timeout
then this: unable to decode response from prometheus endpoint: error making http request: Get https://xxx.xxx.x.x:1443/actuator/prometheus: dial tcp xxx.xxx.x.x:1443: connect: no route to host
After restarting the docker image, it returns to logging valid results - but still in higher frequency.
I tried to reproduced it (on Kubernetes though, but the logic behind is completely the same) and here is what I see:
After I stop the target pod/container there is one failure of the metricset similar to what you see and then the metricset stops. Here what the logs print:
Then if I start the pod again I see that metrics are being reported every minute (the defined interval).
It seems that I cannot reproduce your issue. My questions if you are able to see something similar to what I mentioned above (logs about stopped metricset) and if your issue is because of the timing since of the time between the stop and start is not enough maybe Metricbeat considers this pod as restarted and does not stop the metricset for it.
I got your point here! However I have some questions:
Is the "old" container being stoped and is there a new one after the docker-compose restart? (new container ID) What if you try docker-compose with --force-recreate flag ?
If the container is being stopped, can you catch in the logs of Metricbeat if there is a stop event for the stopped container? Or anything else interesting in the logs?
if I use --force-recreate the container id changes but I still have the same issue.
In the logs I do not see a stop event. I only see this during the recreation of the image (several times): 2020-05-13T12:22:30.286Z INFO module/wrapper.go:252 Error fetching data for metricset prometheus.collector: unable to decode response from prometheus endpoint: error making http request: Get https://xxx.xxx.x.x:1443/actuator/prometheus: dial tcp xxx.xxx.x.x:1443: connect: connection refused 2020-05-13T12:23:18.257Z WARN tlscommon/tls_config.go:79 SSL/TLS verifications disabled.
@MichaelM I think this has to do with how docker-compose actually handles this case and what signals are being sent. I was able to reproduce your case as following:
docker-compose up
change sth
run again docker-compose up
This will create again a starting event but there is no stoping event for some reason. This results in having two Metrciset instances and hence collecting twice.
A workaround I found is running docker-compose down && docker-compose up -d, since docker-compose down will sent a stop event which will make the "old" module to stop (you will need to wait a little bit for the cleanup since it does not happen immediately).
I'm not sure if this is a problem on our end or it is an issue with compose, I will need to investigate it more! Let me know what you think.
great - your suggested workaround seems to work. Lots of thanks for that!
The good thing about just calling docker-compose up without down before was that the service was simply upgraded if needed, and it could run without interruption if there was no change in between deployments.
About the docker events you mentioned: I had a look at the metricbeat logs for both cases (docker-compose with and without down). They looked pretty similar with the following differences (highlighted):
docker-compose up only:
Got a new docker event: {kill [old id]...}
Got a new docker event: {die [old id]...}
Got a new docker event: {stop [old id]...}
Got a new docker event: {rename [old id]...}
Got a new docker event: {create [new id]...}
Got a new docker event: {start [new id]...}
docker-compose down && docker-compose up:
Got a new docker event: {kill [old id]...}
Got a new docker event: {die [old id]...}
Got a new docker event: {stop [old id]...}
Got a new docker event: {destroy [old id]...}
Got a new docker event: {start [new id]...}
So if metricbeat was able to react to kill, die, stop or rename events, this might probably solve the issue in the future.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.