Filebeat might erroneously hold on to the file descriptor?

Hello, Filebeat experts,
we just encountered a disk usage issue after deploying filebeat to production a few weeks.
We set up filebeat to ship log files (file with extension .log) to kafka. The log files are rotated every day, and appended with the timestamp (e.g., XXX.log.datestamp). Log files more than 5 days old will be deleted.
It seems like, on a couple of machines, even though the old log files had been delete by cron jobs, they were still opened by filebeat. As a result the disk space was not freed up, causing disk nearly full. After we stopped the filebeat service on these machines, then the disk was free up by nearly 50%.

I have two questions:
(1) when would filebeat release a file descriptor?
An desirable behavior to us is: For the files being renamed or deleted, filebeat release the fd once it completes processing the file content existed before the renaming/deleting.

btw, I listed all filebeat log lines related to the fd being held by filebeat after deletion. what looks strange to me is that "Harvester started for file" and "Closing" does not match. There are 11 of "Harvester started" and only 5 of "..Closing..". I would expect one more "started" than "closing", why so many "Harvester started" than ""..Closing.."?

grep "READER-0.log" filebeat*
filebeat.2:2017-10-13T16:37:25Z INFO Harvester started for file: READER-0.log
filebeat.2:2017-10-14T00:00:10Z INFO Harvester started for file: READER-0.log
filebeat.2:2017-10-15T00:00:05Z INFO Harvester started for file: READER-0.log
filebeat.2:2017-10-15T16:13:03Z INFO File is inactive: READER-0.log. Closing because close_inactive of 5m0s reached.
filebeat.2:2017-10-16T00:00:07Z INFO Harvester started for file: READER-0.log
filebeat.2:2017-10-16T01:37:23Z INFO File is inactive: READER-0.log. Closing because close_inactive of 5m0s reached.
filebeat.2:2017-10-16T18:12:02Z INFO File is inactive: READER-0.log. Closing because close_inactive of 5m0s reached.
filebeat.2:2017-10-17T00:00:00Z INFO Harvester started for file: READER-0.log
filebeat.1:2017-10-18T00:00:03Z INFO Harvester started for file: READER-0.log
filebeat.1:2017-10-19T00:00:01Z INFO Harvester started for file: READER-0.log
filebeat.1:2017-10-19T23:58:28Z INFO File is inactive: READER-0.log. Closing because close_inactive of 5m0s reached.
filebeat.1:2017-10-20T00:00:07Z INFO Harvester started for file: READER-0.log
filebeat.1:2017-10-21T00:00:06Z INFO Harvester started for file: READER-0.log
filebeat:2017-10-21T17:00:42Z INFO File is inactive: READER-0.log. Closing because close_inactive of 5m0s reached.
filebeat:2017-10-22T00:00:09Z INFO Harvester started for file: READER-0.log
filebeat:2017-10-23T00:00:02Z INFO Harvester started for file: READER-0.log

I also log into the machines that do not have the disk full issue, the number of "Harvester started for file" log lines matches better with the number of "Closing" lines, e..g, 11 of "Harvester started" lines and 9 of "Closing" lines

(2) under what scenarios filebeat might erroneously hold on to the file descriptors?
thanks,
yan

1 Like

Have you had a look at the different close settings? Note, filebeat can only close files if it's not being blocked by unresponsive outputs.

1 Like

@steffens, could you elaborate on "filebeat can only close files if it's not being blocked by unresponsive outputs"? Can you provide an example of "unresponsive outputs"?

Re. the close settings, by default, close_inactive is 5m, which is true in our case per the logging (Closing because close_inactive of 5m0s reached) as I included in my original question. If this "close_inactive" is implemented correctly, we should not have the problem of holding on to open file descriptors too long. Maybe this is related to "unresponsive output" you mentioned?

Re for other settings, we can set close_removed to be true; however, we cannot set "close_renamed" true though. Even after the file is renamed, we still would like filebeat to finish processing the log file.

At this location filebeat publishes a new event. If he spooler has become full, waiting for events to be ACKed, the sending will block on a go-channel.

Only after having the current event published will the harvester check if the file needs to be closed. If the harvester is waiting for new lines (reading file), it can be easily interrupted.

It's a known limitation in 5.x. In 6.0 we try to propagate the harvester shutdown signal to the channel, such that the harvester will get uncoupled.

Still some close-* settings like the close_inactive one 'synchronous' processing. That is, we have to try to read and check duration of time being open. If filebeat is blocked by full queues, it can not read. The close_timeout setting tries to close an harvester asynchronously.

If the file is closed early, but still matches the prospector path settings, it will be picked up. e.g. a file rename does not change it's inode. The inode uniquely identifies a file. That is, close_renamed might close the file now, but if there is more unprocessed content will filebeat pick up the file and continue processing from the last known offset.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.