I am experiencing an issue with Filebeat in our Kubernetes cluster and would like to know if others have encountered a similar situation or have any insights into this behavior.
Issue: I have observed that when there is a network block to Logstash, the Filebeat metrics spike as expected. For example, the filebeat_libbeat_output{events="failed"}
metric shows a spike, indicating failed events. However, after approximately 10 minutes, the metrics return to normal levels, even though the network block is still in place and the issue remains unresolved.
Expectation: I would expect the metrics to continue indicating an issue (i.e., remain at elevated levels) until the network block is removed and Filebeat can successfully send data to Logstash again.
Questions:
- Is this behavior expected from Filebeat's retry and backoff mechanism?
- Have others experienced similar issues with Filebeat metrics in a Kubernetes environment?
- If this behavior is expected, how do you recommend setting up alerts on these metrics to ensure continuous monitoring of such issues?
Any guidance or suggestions would be greatly appreciated.
Thank you!