Are you using autodiscover? Could you share your configuration?
Do you see anything suspicious in Metricbeat logs?
What version of Metricbeat are you using? In Metricbeat 7.9 there were some fixes for a similar issue.
I am using autodiscovery. I used metricbeat 7.6 but read about the errors and tried 7.8 and now metricbeat:7.9-SNAPSHOT . It didn't work for any version though.
I checked the metricbeat pod for errors and there was nothing suspicious. I was also not able to reproduce which pods are found and which are not found.
My configs are as follows (I removed some company internal things):
Second, disable cleanup grace period. It waits some seconds after a pod is deleted before stop monitoring it. This is useful with filebeat to ensure that final log lines are collected, but not so much with Metricbeat, that can keep requesting metrics from unavailable pods.
If issue persist, add autodiscover debug logging, and check in the logs if you see events for the pods that you are missing. You can enable autodiscover debug logging with -d autodiscover. In your config:
I implemented the changes, but our staging cluster is experiencing a lot of deploys atm so I ll verify this on the weekend. Thanks @jsoriano for your help so far
@jsoriano Sorry, I only got around to get back to this today (I am a student working part time). Anyway, the error is persisting even with the changes.
A pod I monitored and has been running for 2 days now, showing the total metric count for the pod (past 15 min, with 2 restarts of metricbeat).
The logs are not really helpful. I filtered for errors and the only ones I got was multiple of
2020-09-03T06:55:41.214Z ERROR [autodiscover] autodiscover/autodiscover.go:209 Auto discover config check failed for config '', won't start runner: string value is not set accessing 'module'
and
2020-09-03T06:55:41.215Z ERROR [autodiscover] autodiscover/autodiscover.go:209 Auto discover config check failed for config '{
But metricbeat always finds some services and as in the example above, restarting the metricbeat pods can help find pods with metrics (or the contrary) without any change to the actual pod.
I can provide the full logs if you want or other infos.
Could there be anything I am missing? (Btw running 7.9.0 now.)
This error is interesting, it indicates that autodiscover is generating some configurations with an incorrect module, but module is clearly set to prometheus in your configuration. Can you try to specify also the collector metricset? It would be like this:
This would show the configuration that it is generating, there we could see why it says that module value is not set. How is the complete config logged by this error?
I tought some pods just did not specify the correct metric port, thats why I didn't deem it interesting. There was an issue online, where they said metricbeat should fail, if a port is declared as metric port but not as pod port. And the errors seem to occur during discovery, so for individual pods and not during init of metricbeat. And what still confuses me is that after restarting metricbeat for a couple of times it eventually finds a pod, just not consistently. (and reports the correct metrics from it)
Ok, I think that the problem is that the variable ${data.kubernetes.annotations.prometheus.io.port} cannot be resolved. We are introducing a change to improve feedback on these problems: https://github.com/elastic/beats/pull/20898
I think that the annotation is prometheus.io/port, and Beats "dedots" annotations by default, by replacing dots with underscores. The slash should be fine. I think that you would need to use this variable like this:
${data.kubernetes.annotations.prometheus_io/port}
In the debug logs you can probably see some autodiscover events, check there how this annotation is being processed.
So far I can not find anything explaining the randomness. Testing a lot of configs in our staging cluster is hard as it would require the devs to change the pod annotations each time as well.
Testing with my own pods, I can say that the uri is always found if it matches (ie config and real annotation are the same.
What I can say is that using prometheus_io is not an option as it is an invalid annotation declaration:
The DaemonSet "test-metricbeat-app-longterm" is invalid:
spec.template.annotations: Invalid value: "prometheus_io/scrape_longterm":
prefix part a DNS-1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character
(e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')
@jsoriano Could it be that not using the co.elastic.metrics/module: prometheus annotation is at fault here? I do set it later in the configs but it could be some sort of race condition, where it expects the module before the processing is done but because of using go routines (or similar) this is not the case.
2020-09-24T08:17:24.277Z DEBUG [autodiscover] autodiscover/autodiscover.go:195 Generated config: {
"metrics_path": "/metrics"
}
2020-09-24T08:17:24.277Z DEBUG [autodiscover] autodiscover/autodiscover.go:259 Got a meta field in the event
2020-09-24T08:17:24.277Z ERROR [autodiscover] autodiscover/autodiscover.go:209 Auto discover config check failed for config '{
"metrics_path": "/metrics"
}', won't start runner: string value is not set accessing 'module'
I realized that metricbeat does not default to a period, or sometimes fails to do so.
What I mean is this:
Specifying a period in the main config works, and in the processor config you can specify a new one. If the pod then does not specify a period, metricbeat sometimes uses the default, but sometimes also just fails.
This is a weird behavior and should be investigated further. Also it would be nice to have "defaults" in under provider and then optional overwrites in the templates configs.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.