Metricbeat scrapping dead containers for prometheus metrics

We are running metricbeat in ecs. We have the prometheus module enabled for docker autodiscovery in the metricbeat configuration. We are running metricbeat version 7.6.2 on Amazon Linux 2. This is the relevant section in the metricbeat configuration.

  - type: docker
    labels.dedot: false
    - condition.equals:
        docker.container.labels.prometheus.scrape: "true"
      - module: prometheus
        metricsets: ["collector"]
        enabled: true
        period: 30s
        hosts: ["${}:${data.port:8080}"]
        metrics_path: "${data.docker.container.labels.prometheus.url:/<some_prometheus_endpoint>}"
        namespace: "metrics"

When the containers recycle, metricbeat keeps trying to collect metrics from dead containers, ie, unreachable urls.

Here's an excerpt from your logs

unable to decode response from prometheus endpoint: error making http request: Get<redacted_url_section>/prometheus: dial tcp 10.X.X.X:YYYY: connect: connection refused

We are running metricbeat as a daemon in the ecs cluster. The ip in this case is the same ip that the metricbeat daemon reporting the logs is on. We run the application container in host mode in ecs. The port yyyy is the port the application container runs on. The reason we get the http connection refused error is because the application container has recycled and got scheduled on a different host. But metricbeat keeps trying to connect to the old host still. It however, does auto discover the new container too. Just that it fails to understand that the old container is no longer there.

Is there something missing in our config or is this a known bug? Thanks in advance for the help.

Could you please try to run the application not in the host mode and see if the metricbeat reacts to it?

You can try with the debug mode on to see if the metricbeat receives such events.

Unfortunately not running in host mode isnt an option for us because all our services are configured like that.

But did some more investigating. The error happens only in a particular case. Sometimes ecs schedules a new container of a service on the same host as one in which another container of that service is already running as part of it's rolling deployment. We have placement strategy of distinctInstances on the ecs scheduler, and that is respected for new containers since they are placed on distinct host, but not with the old hosts during a rolling restart.

Now, because the service runs in host mode, the container starts up and encounters a port clash and the it dies quickly. Also, because the schedular thinks it was able to schedule a new container, it kills the other container on that host too as part of the rolling deployment. And a new container is started on a new host. So, effectively you will see 2 containers die for the same service on the same host in very close successions, one with exit code 1 and another with exit code 143. It's in this case that the prometheus collectors started by both the containers of the same service do not get cleaned up in metricbeat. One collector lingers on and keeps attempting to scrape metrics on host:port.

I have run metricbeat in debug mode and compared the docker events around this time interval with the events received in metricbeat. They line up fine. No docker event is missed. Metricbeat gets all the events from docker in correct order and as far as the debug logs go, seems to process them properly as well. It looks like some logic in stopping the runners for prometheus is encountering some problem.

I have provided a link to the metricbeat debug logs from one host where we were seeing this error. There's 2 services that got scheduled on this host, that have prometheus metrics enabled - api-gateway and data-transform.

The sequence of events that happened are like this :

  • new metricbeat containers come up on host around 2020-07-29 17:18
  • it scans all containers on that host -> one container for api-gateway is already running
  • a few rolling deployments are triggered in the cluster around 2020-07-29 17:40
  • this causes a new container of data-transform and api-gateway among other services to be scheduled on this host a few mins later
  • around 2020-07-29 17:56 both containers of api-gateway die
  • one prometheus runner for api-gateway lingers on and tries to attempt to scrape metrics from host:

I hope this helps in following the logs. Here's the link to the logs -