Hi Team,
We're facing an intermittent issue with our Metricbeat pods. For the past 30 days, our Metricbeat pods have been running smoothly, sending metrics to Elasticsearch, and displaying them on Kibana’s Stack Monitoring. However, we occasionally encounter a problem where the pods stop sending metrics, causing the metrics to disappear from Stack Monitoring. Restarting the pods temporarily resolves the issue, but we would like to understand the root cause and find a permanent fix.
The error messages are consistent each time this issue occurs. Here’s an excerpt from the logs:
2025-04-02T05:47:50.949714203Z {"log.level":"info","@timestamp":"2025-04-02T05:47:50.949Z","log.logger":"publisher_pipeline_output","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run","file.name":"pipeline/client_worker.go","file.line":139},"message":"Attempting to reconnect to backoff(elasticsearch(http://data-es-http.ns-data-hc.svc.cluster.local:9200)) with 42540 reconnect attempt(s)","service.name":"metricbeat","ecs.version":"1.6.0"}
2025-04-02T05:47:54.039948989Z {"log.level":"info","@timestamp":"2025-04-02T05:47:54.039Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":192},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":0}}}},"cpu":{"system":{"ticks":2215620,"time":{"ms":10}},"total":{"ticks":17230240,"time":{"ms":100},"value":17230240},"user":{"ticks":15014620,"time":{"ms":90}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":9},"info":{"ephemeral_id":"1758c9be-cb6b-4d0f-ba79-fc1c7c455548","uptime":{"ms":1987920151},"version":"8.15.1"},"memstats":{"gc_next":111518576,"memory_alloc":55031968,"memory_total":1983322694800,"rss":114221056},"runtime":{"goroutines":84}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0},"write":{"latency":{"histogram":{"count":41350,"max":1951,"mean":243.32421875,"median":233,"min":100,"p75":269.75,"p95":352.25,"p99":631.25,"p999":1935.2750000000142,"stddev":104.49907651064267}}}},"pipeline":{"clients":16,"events":{"active":3217,"retry":1600},"queue":{"filled":{"bytes":4161466,"events":3200,"pct":1},"max_bytes":0,"max_events":3200}}},"system":{"load":{"1":16.78,"15":16.88,"5":17.64,"norm":{"1":0.1311,"15":0.1319,"5":0.1378}}}},"ecs.version":"1.6.0"}}
2025-04-02T05:48:24.040881779Z {"log.level":"info","@timestamp":"2025-04-02T05:48:24.040Z","log.logger":"monitoring","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot","file.name":"log/log.go","file.line":192},"message":"Non-zero metrics in the last 30s","service.name":"metricbeat","monitoring":{"metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":0}}}},"cpu":{"system":{"ticks":2215630,"time":{"ms":10}},"total":{"ticks":17230260,"time":{"ms":20},"value":17230260},"user":{"ticks":15014630,"time":{"ms":10}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":9},"info":{"ephemeral_id":"1758c9be-cb6b-4d0f-ba79-fc1c7c455548","uptime":{"ms":1987950152},"version":"8.15.1"},"memstats":{"gc_next":111518576,"memory_alloc":55413584,"memory_total":1983323076416,"rss":114221056},"runtime":{"goroutines":84}},"libbeat":{"config":{"module":{"running":1}},"output":{"events":{"active":0},"write":{"latency":{"histogram":{"count":41350,"max":1951,"mean":243.32421875,"median":233,"min":100,"p75":269.75,"p95":352.25,"p99":631.25,"p999":1935.2750000000142,"stddev":104.49907651064267}}}},"pipeline":{"clients":16,"events":{"active":3217},"queue":{"filled":{"bytes":4161466,"events":3200,"pct":1},"max_bytes":0,"max_events":3200}}},"system":{"load":{"1":15.29,"15":16.75,"5":17.18,"norm":{"1":0.1195,"15":0.1309,"5":0.1342}}}},"ecs.version":"1.6.0"}}
2025-04-02T05:48:46.817504836Z {"log.level":"error","@timestamp":"2025-04-02T05:48:46.817Z","log.logger":"publisher_pipeline_output","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run","file.name":"pipeline/client_worker.go","file.line":148},"message":"Failed to connect to backoff(elasticsearch(http://data-es-http.ns-data-hc.svc.cluster.local:9200)): Get \"http://data-es-http.ns-data-hc.svc.cluster.local:9200\": context canceled","service.name":"metricbeat","ecs.version":"1.6.0"}
From the logs, it appears that Metricbeat is struggling to maintain a stable connection with Elasticsearch, leading to a failure in sending metrics. Some key observations:
- The error suggests a connection timeout or network issue between Metricbeat and Elasticsearch.
- It happens intermittently but always results in the same errors.
- Restarting the pods resolves the issue temporarily.