In the Stack Monitoring page in Kibana, in the beats section, there is a column called Output Errors. What is this about? I can't see errors in the logs of the remote beats that are being monitored.
If it is about events/logs not being written due to an error when connecting to elasticsearch, does this mean that the logs get lost? Will their sending be retried until success?
Thanks @stephenb . To add to the available info, when you click a specific beat agent in the kibana interface, there are some info ballons that mention:
For the "Fail rate" metric:
Interval: 10 seconds.
Failed in Pipeline: Failures that happened before event was added to the publishing pipeline (output was disabled or publisher client closed).
Dropped in Pipeline: Events that have been dropped after N retries (N = max_retries setting).
Dropped in Output: (Fatal drop) Events dropped by the output as being "invalid." The output still acknowledges the event for the Beat to remove it from the queue..
Retry in Pipeline: Events in the pipeline that are trying again to be sent to the output
And in the "Output errors" metric:
Interval: 10 seconds.
Sending: Errors in writing the response from the output.
Receiving: Errors in reading the response from the output
From the above, I feel that I am safe if the Fail Rate is zero. I don't understand what the output error is though and how it could potentially relate to these fail rates. Looking forward to a reply from @jsoriano.
I agree that this is not very clear in the docs, even after looking at the code I am not 100% sure of the meaning, but let me try to explain
beat.stats.libbeat.output.write.errors are low-level errors in the underlying http request or tcp connection. These errors are probably harm-less if there are no failed or dropped events, but a high number may indicate that there is some kind of issue in the network or in the output cluster, or that there is some kind of congestion somewhere.
beat.stats.libbeat.output.events.failed indicates a higher-level failure at the output level, this means that the output hasn't been able to confirm if an event has been written, and it will be probably retried. So in general, they are transient failures that shouldn't lead to data loss.
beat.stats.libbeat.output.events.dropped are dropped events, they are lost for sure. This uses to indicate that the beat is sending events that cannot be indexed. This uses to be a bug in Beats, or some kind of misconfiguration or weird setup. Logs may help to identify the culprit when they happen.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.