We are running metricbeat in ecs. We have the prometheus module enabled for docker autodiscovery in the metricbeat configuration. We are running metricbeat version 7.6.2 on Amazon Linux 2. This is the relevant section in the metricbeat configuration.
When the containers recycle, metricbeat keeps trying to collect metrics from dead containers, ie, unreachable urls.
Here's an excerpt from your logs
unable to decode response from prometheus endpoint: error making http request: Get http://ip-10-x-x-x.us-west-2.compute.internal:yyyy/<redacted_url_section>/prometheus: dial tcp 10.X.X.X:YYYY: connect: connection refused
We are running metricbeat as a daemon in the ecs cluster. The ip in this case is the same ip that the metricbeat daemon reporting the logs is on. We run the application container in host mode in ecs. The port yyyy is the port the application container runs on. The reason we get the http connection refused error is because the application container has recycled and got scheduled on a different host. But metricbeat keeps trying to connect to the old host still. It however, does auto discover the new container too. Just that it fails to understand that the old container is no longer there.
Is there something missing in our config or is this a known bug? Thanks in advance for the help.
Unfortunately not running in host mode isnt an option for us because all our services are configured like that.
But did some more investigating. The error happens only in a particular case. Sometimes ecs schedules a new container of a service on the same host as one in which another container of that service is already running as part of it's rolling deployment. We have placement strategy of distinctInstances on the ecs scheduler, and that is respected for new containers since they are placed on distinct host, but not with the old hosts during a rolling restart.
Now, because the service runs in host mode, the container starts up and encounters a port clash and the it dies quickly. Also, because the schedular thinks it was able to schedule a new container, it kills the other container on that host too as part of the rolling deployment. And a new container is started on a new host. So, effectively you will see 2 containers die for the same service on the same host in very close successions, one with exit code 1 and another with exit code 143. It's in this case that the prometheus collectors started by both the containers of the same service do not get cleaned up in metricbeat. One collector lingers on and keeps attempting to scrape metrics on host:port.
I have run metricbeat in debug mode and compared the docker events around this time interval with the events received in metricbeat. They line up fine. No docker event is missed. Metricbeat gets all the events from docker in correct order and as far as the debug logs go, seems to process them properly as well. It looks like some logic in stopping the runners for prometheus is encountering some problem.
I have provided a link to the metricbeat debug logs from one host where we were seeing this error. There's 2 services that got scheduled on this host, that have prometheus metrics enabled - api-gateway and data-transform.
The sequence of events that happened are like this :
new metricbeat containers come up on host around 2020-07-29 17:18
it scans all containers on that host -> one container for api-gateway is already running
a few rolling deployments are triggered in the cluster around 2020-07-29 17:40
this causes a new container of data-transform and api-gateway among other services to be scheduled on this host a few mins later
around 2020-07-29 17:56 both containers of api-gateway die
one prometheus runner for api-gateway lingers on and tries to attempt to scrape metrics from host:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.