Recovery mechanism in filebeat

kasi · April 15, 2016, 9:20pm

I am seeing a disturbing behavior with filebeat and hope it is not true:

When the output destination is not reachable or is down, filebeat tries for some time and then
gives up, flushes the events to log file if debug is enabled. The question I have is will filebeat try to send the same events that it could not send to logstash when it is able to establish connection ?

Also, how does filebeat mark a file as read? when it reads or when it gets an ack back from the
destination?

Thanks,

Kasi

steffens · April 18, 2016, 11:41am

filebeat has send-at-least-once semantics. That is on failure it will retry with backoff (increasing waiting times) before retrying.

filebeat uses the registry file (see config options) to remember last known state to be reported ACK by outputs. That is, only after having received an ACK by logstash/elasticsearch will the offset be updated.

kasi · April 18, 2016, 3:35pm

Thanks a lot

gringo · April 21, 2016, 12:38am

What would happen in a situation when ACK to filebeat was lost. Will filebeat resend it to logstash? If this is the case will logstash reprocess the same event and cause duplication?

steffens · April 21, 2016, 1:17am

Yes. Without ACK filebeat can not tell if logstash has received and processed events. That is, it has to send events once again. logstash is doing no deduplication. This must be solved either on protocol level (but will still be tricky in case presence of load-balancing) or via event deduplication (e.g. by having logstash generate event id).

One potential solution to implement deduplication via logstash+elasticsearch I mentioned here: Detect filebeat retries to remove duplicates in the server side

The same 'problem' exists with other protocols/outputs (kafka, redis) as well, as there is no support for dealing with old resends. On the other hand, advantage with send-at-least-once semantics is reduced bookkeeping, especially in the presence of load balancing.

Often-case (given network is not unstable for much too long), it's good enough. But be adviced to monitor you systems and if things get wonky stop data ingestion (e.g. turn kafka off). For example kafka being notorious for storing everything based on retention times not taking disk space usage into account (well, to be fair, behaviour is configurable) until system breaks in bad ways. This is a general problem (design decision) in some systems. So you either drop events (not possible with filebeat) or stop data ingestion in presence of systems getting unresponsive and overloaded (taking X-times disk/CPU then normal).

Trying to solve deduplication on protocol level would require e.g. sequence number to detect resends on server side (e.g. as done by TCP to detect duplicate segments) + have consensus among servers in presence of load-balancing in order to detect resends being forwarded to another node (failover handling by client). For dealing with client/server restarts you might want to log sequence numbers to disks in addition. And there goes scalability.

helloworld · February 13, 2017, 4:18am

I have a little bit of confusion here. Going through the documentation , the definition of backoff says "The backoff option defines how long Filebeat waits before checking a file again after EOF is reached." No mention about the acknowledgements. Could you please clarify on this

steffens · February 13, 2017, 2:20pm

@helloworld sounds like you are confusing the input with the output. Input side tails the file until EOF. With log entries potentially being added to log files filebeat has fully processed, filebeat keeps the file open and retries reading from same file (after EOF). In order to not have a busy loop using 100% CPU once end of a file has been found, the input backoffs a little (sleep), waiting for new entries written to a log file.

helloworld · March 8, 2017, 5:50am

Sorry for delayed response. Thanks for clarifying on this part. Quoting from the documentation

"In situations where the defined output is blocked and has not confirmed all events, Filebeat will keep trying to send events until the output acknowledges that it has received the events".

As you mentioned, Filebeat keeps on retrying with increasing backoff in case of failure in the destination, I am curious to know, does filebeat gives us any setting which control the behavior of back off factor of the output event.

Also, is any there any time out for the filebeat to stop retrying after some failures. Assuming, we have a timeout, will the filebeat still craws the log files for new updates. I am actually interested to know the behavior of filebeat in the case where destination is unreachable and log file is keep on updating with few hundred lines every second.

ruflin · March 11, 2017, 4:22pm

If you use Logstash output, the backoff on the output level happens automatically to make sure LS is not overloaded.

Filebeat will keep open the files until it successfully sent the events. In 5.3 you will have the option to use close_timeout to force the closing of files. But in case files are deleted from the files system before the output becomes available again, this can mean you loose data.

billzy · May 17, 2017, 1:21pm

So if LS destination cannot be reached, filebeat keeps the current offset recorded in registry, after a certain amount of time, let's say 5 minutes, during which many new logs generated and offset go forward very far, then after LS is back online, filebeat will connect LS and send all logs from the previously recorded offset until it reach EOF as fast as possible, will filebeat behavior like this ? if so this will spend lots of CPU in my case.

steffens · May 17, 2017, 11:34pm

Yes, that's the behavior. You can limit CPU usage using system utilities (e.g. nice, taskset, cgroups / systemd / docker resource limits). You can configure the number of active OS threads in the go-runtime via max_procs in your config file (sets GOMAXPROCS). Setting to 1, limits runtime to 1 active OS thread.

Topic		Replies	Views
Detect filebeat retries to remove duplicates in the server side Beats filebeat	3	1915	July 5, 2017
How does Filebeat manage problems connecting to Logstash? Beats filebeat	2	1396	July 5, 2017
Logstash not sending ACK to Filebeat thereby causing duplicate events Logstash	3	1114	March 13, 2020
Duplicated events using Filebeat Beats filebeat	14	4237	July 6, 2017
Filebeat lost data Beats filebeat	13	3480	August 28, 2017

Recovery mechanism in filebeat

Related topics