Metricbeat Pods Stopping Metrics Intermittently – Connection Issues

calvin01 · April 2, 2025, 6:36am

Hi Team,

We're facing an intermittent issue with our Metricbeat pods. For the past 30 days, our Metricbeat pods have been running smoothly, sending metrics to Elasticsearch, and displaying them on Kibana’s Stack Monitoring. However, we occasionally encounter a problem where the pods stop sending metrics, causing the metrics to disappear from Stack Monitoring. Restarting the pods temporarily resolves the issue, but we would like to understand the root cause and find a permanent fix.

The error messages are consistent each time this issue occurs. Here’s an excerpt from the logs:

2025-04-02T05:47:50.949714203Z {"log.level":"info","@timestamp":"2025-04-02T05:47:50.949Z","log.logger":"publisher_pipeline_output","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run","file.name":"pipeline/client_worker.go","file.line":139},"message":"Attempting to reconnect to backoff(elasticsearch(http://data-es-http.ns-data-hc.svc.cluster.local:9200)) with 42540 reconnect attempt(s)","service.name":"metricbeat","ecs.version":"1.6.0"}
2025-04-02T05:47:54.039948989Z {"log.level":"info","@timestamp":"2025-04-02T05:47:54.039Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":192},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":0}}}},"cpu":{"system":{"ticks":2215620,"time":{"ms":10}},"total":{"ticks":17230240,"time":{"ms":100},"value":17230240},"user":{"ticks":15014620,"time":{"ms":90}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":9},"info":{"ephemeral_id":"1758c9be-cb6b-4d0f-ba79-fc1c7c455548","uptime":{"ms":1987920151},"version":"8.15.1"},"memstats":{"gc_next":111518576,"memory_alloc":55031968,"memory_total":1983322694800,"rss":114221056},"runtime":{"goroutines":84}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0},"write":{"latency":{"histogram":{"count":41350,"max":1951,"mean":243.32421875,"median":233,"min":100,"p75":269.75,"p95":352.25,"p99":631.25,"p999":1935.2750000000142,"stddev":104.49907651064267}}}},"pipeline":{"clients":16,"events":{"active":3217,"retry":1600},"queue":{"filled":{"bytes":4161466,"events":3200,"pct":1},"max_bytes":0,"max_events":3200}}},"system":{"load":{"1":16.78,"15":16.88,"5":17.64,"norm":{"1":0.1311,"15":0.1319,"5":0.1378}}}},"ecs.version":"1.6.0"}}
2025-04-02T05:48:24.040881779Z {"log.level":"info","@timestamp":"2025-04-02T05:48:24.040Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":192},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":0}}}},"cpu":{"system":{"ticks":2215630,"time":{"ms":10}},"total":{"ticks":17230260,"time":{"ms":20},"value":17230260},"user":{"ticks":15014630,"time":{"ms":10}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":9},"info":{"ephemeral_id":"1758c9be-cb6b-4d0f-ba79-fc1c7c455548","uptime":{"ms":1987950152},"version":"8.15.1"},"memstats":{"gc_next":111518576,"memory_alloc":55413584,"memory_total":1983323076416,"rss":114221056},"runtime":{"goroutines":84}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0},"write":{"latency":{"histogram":{"count":41350,"max":1951,"mean":243.32421875,"median":233,"min":100,"p75":269.75,"p95":352.25,"p99":631.25,"p999":1935.2750000000142,"stddev":104.49907651064267}}}},"pipeline":{"clients":16,"events":{"active":3217},"queue":{"filled":{"bytes":4161466,"events":3200,"pct":1},"max_bytes":0,"max_events":3200}}},"system":{"load":{"1":15.29,"15":16.75,"5":17.18,"norm":{"1":0.1195,"15":0.1309,"5":0.1342}}}},"ecs.version":"1.6.0"}}
2025-04-02T05:48:46.817504836Z {"log.level":"error","@timestamp":"2025-04-02T05:48:46.817Z","log.logger":"publisher_pipeline_output","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run","file.name":"pipeline/client_worker.go","file.line":148},"message":"Failed to connect to backoff(elasticsearch(http://data-es-http.ns-data-hc.svc.cluster.local:9200)): Get \"http://data-es-http.ns-data-hc.svc.cluster.local:9200\": context canceled","service.name":"metricbeat","ecs.version":"1.6.0"}

From the logs, it appears that Metricbeat is struggling to maintain a stable connection with Elasticsearch, leading to a failure in sending metrics. Some key observations:

The error suggests a connection timeout or network issue between Metricbeat and Elasticsearch.
It happens intermittently but always results in the same errors.
Restarting the pods resolves the issue temporarily.

Topic		Replies	Views
Metricbeat output stops appearing in Kibana after a few minutes Beats metricbeat	4	613	May 22, 2018
Trouble debugging MetricBeat connection issues Beats metricbeat	3	721	February 5, 2020
Metricbeat process dropping Beats metricbeat	3	690	August 10, 2017
Metricbeat is unable to send data to elastic search Beats metricbeat	4	2681	February 22, 2019
Metricbeat > Logstash errors: client is not connected Beats metricbeat	5	1330	June 21, 2019

Metricbeat Pods Stopping Metrics Intermittently – Connection Issues

Related topics