Multiple monitoring cycles after recreating docker image

MichaelM · May 7, 2020, 2:49pm

Hi,

I am currently monitoring a dockerized service with metricbeat using autodiscover, which works fine so far.

The problem: when I deploy a newer version of the service, metricbeat starts an additional monitoring cycle which I don't know how to prevent.

More details:

I have a metricbeat.yml file with autodiscover and period set to 20 s
for the service I want to monitor, I have a docker-compose.yml file
I start the service using docker-compose up
works fine: I receive logs every 20 s (let's say at 0:05, 0:25, 0:45)
then I change the docker-compose.yml (specifically the image)
I run docker-compose up again
docker recreates the image
from now on I receive additional logs every 20 s (at 0:10, 0:30, 0:50), which means in total: at 0:05, 0:10; 0:25, 0:30, 0:45, 0:50
even stopping and starting the docker image does not change that
the only fix is to stop and restart metricbeat

Can I fix that somehow?

Best regards,
Michael

ChrsMark · May 8, 2020, 8:25am

Hi @MichaelM!

When you stop the containers of the docker-compose project, does Metricbeat continues collecting logs?

MichaelM · May 8, 2020, 1:35pm

Hi @ChrsMark,

metricbeat continues polling the actuator endpoint but logs errors:
unable to decode response from prometheus endpoint: error making http request: Get https://xxx.xxx.x.x:1443/actuator/prometheus: dial tcp xxx.xxx.x.x:1443: i/o timeout

then this:
unable to decode response from prometheus endpoint: error making http request: Get https://xxx.xxx.x.x:1443/actuator/prometheus: dial tcp xxx.xxx.x.x:1443: connect: no route to host

After restarting the docker image, it returns to logging valid results - but still in higher frequency.

Regards,
Michael

ChrsMark · May 11, 2020, 8:59am

Hi @MichaelM!

I tried to reproduced it (on Kubernetes though, but the logic behind is completely the same) and here is what I see:

After I stop the target pod/container there is one failure of the metricset similar to what you see and then the metricset stops. Here what the logs print:

2020-05-11T08:52:20.709Z	DEBUG	[autodiscover]	autodiscover/autodiscover.go:252	Got a meta field in the event
2020-05-11T08:52:20.709Z	DEBUG	[autodiscover]	cfgfile/list.go:62	Starting reload procedure, current runners: 0
2020-05-11T08:52:20.709Z	DEBUG	[autodiscover]	cfgfile/list.go:80	Start list: 1, Stop list: 0
2020-05-11T08:52:20.710Z	DEBUG	[autodiscover]	cfgfile/list.go:101	Starting runner: redis [metricsets=2]
2020-05-11T08:52:20.710Z	DEBUG	[module]	module/wrapper.go:120	Starting Wrapper[name=redis, len(metricSetWrappers)=2]
2020-05-11T08:52:20.710Z	DEBUG	[module]	module/wrapper.go:182	Starting metricSetWrapper[module=redis, name=info, host=172.17.0.19:6379]
2020-05-11T08:52:20.710Z	DEBUG	[module]	module/wrapper.go:182	Starting metricSetWrapper[module=redis, name=keyspace, host=172.17.0.19:6379]

Then if I start the pod again I see that metrics are being reported every minute (the defined interval).

It seems that I cannot reproduce your issue. My questions if you are able to see something similar to what I mentioned above (logs about stopped metricset) and if your issue is because of the timing since of the time between the stop and start is not enough maybe Metricbeat considers this pod as restarted and does not stop the metricset for it.

Let me know what you think.

MichaelM · May 12, 2020, 12:23pm

Hi @ChrsMark,

thanks for checking!
From what I understand from your post you did this:

start a container
start metricbeat
logs are created once per minute (configured interval)
stop container
restart container
logs are created once per minute (configured interval)
This scenario works fine for us as well.

What we are doing:

start container
a) deploy docker-compose.yml
b) call docker-compose up
start metricbeat
logs are created once per minute (configured interval)
redeploy container
a) change docker-compose.yml
b) call docker-compose up
logs are created twice per minute (too many)

Where do I find the logs you mentioned?

ChrsMark · May 12, 2020, 12:43pm

Hi @MichaelM!

I got your point here! However I have some questions:

Is the "old" container being stoped and is there a new one after the docker-compose restart? (new container ID) What if you try docker-compose with --force-recreate flag ?
If the container is being stopped, can you catch in the logs of Metricbeat if there is a stop event for the stopped container? Or anything else interesting in the logs?

MichaelM · May 13, 2020, 12:30pm

Hi @ChrsMark,

if I use --force-recreate the container id changes but I still have the same issue.

In the logs I do not see a stop event. I only see this during the recreation of the image (several times):
2020-05-13T12:22:30.286Z INFO module/wrapper.go:252 Error fetching data for metricset prometheus.collector: unable to decode response from prometheus endpoint: error making http request: Get https://xxx.xxx.x.x:1443/actuator/prometheus: dial tcp xxx.xxx.x.x:1443: connect: connection refused
2020-05-13T12:23:18.257Z WARN tlscommon/tls_config.go:79 SSL/TLS verifications disabled.

ChrsMark · May 14, 2020, 10:28am

@MichaelM I think this has to do with how docker-compose actually handles this case and what signals are being sent. I was able to reproduce your case as following:

docker-compose up
change sth
run again docker-compose up

This will create again a starting event but there is no stoping event for some reason. This results in having two Metrciset instances and hence collecting twice.

A workaround I found is running docker-compose down && docker-compose up -d, since docker-compose down will sent a stop event which will make the "old" module to stop (you will need to wait a little bit for the cleanup since it does not happen immediately).

I'm not sure if this is a problem on our end or it is an issue with compose, I will need to investigate it more! Let me know what you think.

MichaelM · May 14, 2020, 11:55am

Hi @ChrsMark,

great - your suggested workaround seems to work. Lots of thanks for that!

The good thing about just calling docker-compose up without down before was that the service was simply upgraded if needed, and it could run without interruption if there was no change in between deployments.

About the docker events you mentioned: I had a look at the metricbeat logs for both cases (docker-compose with and without down). They looked pretty similar with the following differences (highlighted):

docker-compose up only:

Got a new docker event: {kill [old id]...}
Got a new docker event: {die [old id]...}
Got a new docker event: {stop [old id]...}
Got a new docker event: {rename [old id]...}
Got a new docker event: {create [new id]...}
Got a new docker event: {start [new id]...}

docker-compose down && docker-compose up:

Got a new docker event: {kill [old id]...}
Got a new docker event: {die [old id]...}
Got a new docker event: {stop [old id]...}
Got a new docker event: {destroy [old id]...}
Got a new docker event: {start [new id]...}

So if metricbeat was able to react to kill, die, stop or rename events, this might probably solve the issue in the future.

Anyway, thanks again!

Regards,
Michael

ChrsMark · May 15, 2020, 9:26am

Awesome @MichaelM, thank you for sharing your input!
I finally found out that this is a race condition bug. I prepared a patch for this: https://github.com/elastic/beats/pull/18564

Thanks!

system · June 12, 2020, 9:26am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.