I have a use case in which, multiple large files are pushed on a server. These are read by Filebeat and sent to Logstash.
I have written a script to read the registry file and get the filenames that have been acknowledged by Fielbeat.
My question is, if it is safe to delete these files? When is a file actually updated in the registry? Are the files updated after they are completely processed by beats or are they updated when beats starts to process them?
Filebeat first saves a state of the file when it encounters it first. In this case the offset is 0, because there is not any messages which have been acknowledged. After the output returns the ACK, the number of ACKed messages in bytes are saved as the offset of the file. If the file is completely processed, meaning the last offset in the registry points to the end of the file, and no more lines are coming in, it should be safe to delete the file.
Thank you for the clarification. Here is a python code for deleting files based on registry file information:
#!/bin/python
import json
import os
def watcher(filepath="/var/lib/filebeat/registry"):
"""Delete the files specified in the registry file."""
f = open(filepath, 'r')
a = f.readlines()
f.close()
b = json.loads(a[0])
for x in b:
try:
if int(x["offset"]) == os.stat(x["source"]).st_size:
os.remove(x["source"])
else:
print("File not processed, offset: ", x["source"])
except Exception as e:
print(e)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.