Recovery mechanism in filebeat


(kasi) #1

I am seeing a disturbing behavior with filebeat and hope it is not true:

When the output destination is not reachable or is down, filebeat tries for some time and then
gives up, flushes the events to log file if debug is enabled. The question I have is will filebeat try to send the same events that it could not send to logstash when it is able to establish connection ?

Also, how does filebeat mark a file as read? when it reads or when it gets an ack back from the
destination?

Thanks,

Kasi


Reliability of Filebeat
(Steffen Siering) #2

filebeat has send-at-least-once semantics. That is on failure it will retry with backoff (increasing waiting times) before retrying.

filebeat uses the registry file (see config options) to remember last known state to be reported ACK by outputs. That is, only after having received an ACK by logstash/elasticsearch will the offset be updated.


(kasi) #3

Thanks a lot


#4

What would happen in a situation when ACK to filebeat was lost. Will filebeat resend it to logstash? If this is the case will logstash reprocess the same event and cause duplication?


(Steffen Siering) #5

Yes. Without ACK filebeat can not tell if logstash has received and processed events. That is, it has to send events once again. logstash is doing no deduplication. This must be solved either on protocol level (but will still be tricky in case presence of load-balancing) or via event deduplication (e.g. by having logstash generate event id).

One potential solution to implement deduplication via logstash+elasticsearch I mentioned here: Detect filebeat retries to remove duplicates in the server side

The same 'problem' exists with other protocols/outputs (kafka, redis) as well, as there is no support for dealing with old resends. On the other hand, advantage with send-at-least-once semantics is reduced bookkeeping, especially in the presence of load balancing.

Often-case (given network is not unstable for much too long), it's good enough. But be adviced to monitor you systems and if things get wonky stop data ingestion (e.g. turn kafka off). For example kafka being notorious for storing everything based on retention times not taking disk space usage into account (well, to be fair, behaviour is configurable) until system breaks in bad ways. This is a general problem (design decision) in some systems. So you either drop events (not possible with filebeat) or stop data ingestion in presence of systems getting unresponsive and overloaded (taking X-times disk/CPU then normal).

Trying to solve deduplication on protocol level would require e.g. sequence number to detect resends on server side (e.g. as done by TCP to detect duplicate segments) + have consensus among servers in presence of load-balancing in order to detect resends being forwarded to another node (failover handling by client). For dealing with client/server restarts you might want to log sequence numbers to disks in addition. And there goes scalability.


(Abhinaythurlapati) #6

I have a little bit of confusion here. Going through the documentation , the definition of backoff says "The backoff option defines how long Filebeat waits before checking a file again after EOF is reached." No mention about the acknowledgements. Could you please clarify on this


(Steffen Siering) #7

@helloworld sounds like you are confusing the input with the output. Input side tails the file until EOF. With log entries potentially being added to log files filebeat has fully processed, filebeat keeps the file open and retries reading from same file (after EOF). In order to not have a busy loop using 100% CPU once end of a file has been found, the input backoffs a little (sleep), waiting for new entries written to a log file.


(Abhinaythurlapati) #8

Sorry for delayed response. Thanks for clarifying on this part. Quoting from the documentation

"In situations where the defined output is blocked and has not confirmed all events, Filebeat will keep trying to send events until the output acknowledges that it has received the events".

As you mentioned, Filebeat keeps on retrying with increasing backoff in case of failure in the destination, I am curious to know, does filebeat gives us any setting which control the behavior of back off factor of the output event.

Also, is any there any time out for the filebeat to stop retrying after some failures. Assuming, we have a timeout, will the filebeat still craws the log files for new updates. I am actually interested to know the behavior of filebeat in the case where destination is unreachable and log file is keep on updating with few hundred lines every second.


(ruflin) #9

If you use Logstash output, the backoff on the output level happens automatically to make sure LS is not overloaded.

Filebeat will keep open the files until it successfully sent the events. In 5.3 you will have the option to use close_timeout to force the closing of files. But in case files are deleted from the files system before the output becomes available again, this can mean you loose data.


(ZHIYU YU) #10

So if LS destination cannot be reached, filebeat keeps the current offset recorded in registry, after a certain amount of time, let's say 5 minutes, during which many new logs generated and offset go forward very far, then after LS is back online, filebeat will connect LS and send all logs from the previously recorded offset until it reach EOF as fast as possible, will filebeat behavior like this ? if so this will spend lots of CPU in my case.


(Steffen Siering) #11

Yes, that's the behavior. You can limit CPU usage using system utilities (e.g. nice, taskset, cgroups / systemd / docker resource limits). You can configure the number of active OS threads in the go-runtime via max_procs in your config file (sets GOMAXPROCS). Setting to 1, limits runtime to 1 active OS thread.


(system) #12