I intend to drop .json files into a directory that Filebeat is monitoring (they eventually end up in Elasticsearch)
The goal is to hit the data source for the latest changes every 15 minutes and output to a .json file. The number of changes will probably be pretty small on average. Total records today is <600k.
Once I'm done processing the file I would like to remove it.
Scenario A: Create a new .json file every 15 minutes with the changes. -OR- Scenario B: Create a new .json file each day and append changes to it every 15 minutes.
Q1: Which scenario is the better route to go? Does it matter?
Q2: I am responsible for manually removing the .json file, correct?
If so, I was planning on reading the registry file and removing any files not found therein.
I am thinking that I should set close_eof: true to close the file as soon as it is read.
At this point the file is still being monitored for changes and will continue to do so.
I, however, want to go ahead and remove this file from the registry and take it off the radar. The file will technically be "inactive" as soon as I get done reading it. Q3: Should I set clean_inactive to a "small" value, like 30 seconds?
However, this value should be greater than ignore_older + scan_frequency and ignore_older is supposed to be greater than close_inactive. Q4: Does close_inactive really apply/matter if I am using close_eof? The file will be closed long before close_inactive would hit.
Hope my thoughts and questions haven't convoluted the simple problem of just wanting to remove files after processing.
A1: I prefer scenario B as it follows a more traditional log file approach.
A2: You are responsible for removing the json "log" file (assuming you do not have space to let it grow indefinitely) but you can use logrotate to do this for you. If you configure filebeat correctly, it will keep the file handle open until the logs are shipped, regardless of if they have or have not been rotated already.
A3: I wouldn't touch clean_inactive or ignore_older for this case (if you use scenario B from Q1). Since you are only updating the file every 15 minutes, you can set the close inactive to a low value (default is 5 min). It will just free the file handle until the next scan, see scan_frequency.
A4: No need to use close_eof in this case (if you use scenario B from Q1).
One thing to consider is if your json object will be represented on a single line (easier to work with and better in this case) or multiple lines.
Have a look at clean_removed. I think that is more what you are looking for. So instead of detecting what is missing in the registry file, filebeat will detect what is missing on disk and remove the state for the file.
Hi Troy,
Thank you for your response. I am leaning towards Scenario B as well.
I am putting a time stamp in the file name that I could easily parse and just remove files after a day or week has gone by.
I was hesitating to remove even a day old file, if for some reason Filebeat wasn't done with it. Hence the reason I looked into the clean settings. I don't want to delete the file until Filebeat said it was done, by removing it from the registry. Unfortunately I haven't gotten it to remove the file from the registry on my local yet.
I agree with Scenario B not needing close_eof and the setting close_inactive to a lesser value sounds good.
I plan to serialize the entities and put one per line in the file. So far that has worked fine on my local. Open to thoughts though.
Hey rufllin,
Thanks for the response.
In my response to Troy, I probably spoke to your thoughts, but I'll repeat them just to be clear on fear of redundancy
I plan to leave clean_removed enabled. Certainly want to clear out the registry after a file has been deleted.
I'm new to Filebeat and I'm hesitant to delete a file from the directory unless I know Filebeat is done with it, aka deleted from the registry.
Here is my Filebeat config. Files are not being purged from the registry even after plenty of time has passed.
This may be a question for a separate thread, but I feel like this config should work for what I'm trying to do.
DBG Check file for harvesting: C:\Elasticsearch5\elasticsearch\input\batch_2016.12.15.json
DBG Ignore file because ignore_older reached: C:\Elasticsearch5\elasticsearch\input\batch_2016.12.15.json
DBG Do not write state for ignore_older because clean_inactive reached
DBG Check file for harvesting: C:\Elasticsearch5\elasticsearch\input\batch_2016.12.16.json
DBG Ignore file because ignore_older reached: C:\Elasticsearch5\elasticsearch\input\batch_2016.12.16.json
DBG Do not write state for ignore_older because clean_inactive reached
EDIT: I just re-read your posts and want to confirm one thing, are you deleting the older files or are they still in that directory? If you are deleting them my response is useless to you, sorry.
I may be wrong, but it appears filebeat is working just as it should. I am going to be verbose in this explanation, even though I assume you understand most of the config options. The verbosity is only to ensure clarity.
Here is my interpretation of your config:
- input_type: log
paths:
# Look in the es/input/ directory for any file that matches *.json
- C:/Elasticsearch5/elasticsearch/input/*.json
# if the file has not been added to (thus harvested) after one minute
# of the last line being added, close the file handle
close_inactive: 1m
# if the file has not been modified within a 2 min time frame, then
# ignore it (ie. do not harvest)
ignore_older: 2m
# scan for files (matching the path) every 10 seconds
scan_frequency: 10s
# if the file has been inactive for more than 3 min, do not store the
# state of it in the registry
clean_inactive: 3m
What this tells me is
filebeat will scan that directory every 10 seconds for any file matching *.json.
a. If it finds a file that has been updated in the last 2 minutes, it harvests from the file (thus creating a file handle). The file handle will remain until the file does not have any new data added within 1 minute of the time the last new line of the file was harvested. At that point, it will close that file ( closes the file handle).
b. If it finds a file that has not been updated in the last 2 minutes, that file is ignored.
If the file has been ignored for longer than 3 min, cleanup the state of the file and do note write the state of the file to the registry. Note: The state of a file is identifying information of the file (on linux it it is the inode) and the offset which tells the harvester where to start harvesting to ensure all lines are harvested.
In your case, the file is checked to be harvested every 10 seconds (since the path/filename matches), it is ignored since the file has not been modified within 2 minute. Before three minutes pass of the file not changing, the state of the file while be stored in the registry every 10 seconds. After three minutes pass of the file not changing, the state of the file will be completely wiped from the registry and new states will not be written.
Troy your response is very helpful and confirms my understanding.
First, the plan is to remove the files, but only after they have been removed from the registry.
My last post is my current problem...the files aren't being removed from the registry, even though the "three minutes" has passed. Your post confirmed that they should be.
I'm not even sure I want\need the ignore_older setting, but I can't comment it out since clean_inactive needs it and will Filebeat will complain if it's missing. If I extend it, it will conflict with clean_inactive which is supposed to be longer than ignore_older + scan_frequency.
The debug output from your earlier post does not indicate that the the files are not being purged from the registry. It actually does the opposite and suggests that the file are correctly being removed from the registry.
It is important to note that even if a file is not in the registry, on the next scan (based on scan frequency) filebeat will still match that file, take a look at it and decide if anything should be harvested based on your other settings. This is what the debug output shows is happening.
You should locate the registry file (I do not know the path of this on windows) and take a look at the information it is storing. From here, I think you will find the information is being cleaned from the registry.
I've been monitoring it in Notepad++.
If I drop a new file into the directory it will get processed as expected, but the registry entry does not go away after 3 minutes.
If I add a second file into the directory after the 3 minutes, the first entry till be removed and the second will take it's place.
If I add the second file before the 3 minutes, both entries will remain. A third file after the 3 minutes will replace both.
What version of filebeat are you on? I have found some previous issues with registry cleaning in a few of the filebeat releases though none match this one identically.
You are correct that it should be cleaning those and I will attempt to reproduce the issue. It may be worth you submitting a bug report if we can demonstrate it.
I was able to reproduce this behavior exactly as you described. It appears that the registry file is only modified if there are file state updates that need to be written to the file. In the cases we describe, there are not any file state updates that need to be written to the registry so the current states in the registry are not overwritten. It is only when there is a file state to update in the registry file that file state entries will be overwritten (cleared).
If the only action to take on the registry file is to remove entries, nothing is done. If an update is made to a file (thus new state should be stored) and a different file reaches " Do not write state for ignore_older because clean_inactive reached", only the state for fileA will be written.
I do not have knowledge to see if this is an expected behavior or not, though it should be mentioned in the docs so people like us do not go on a wild goose chase
Let me know if you want to submit a new issue or if you would like me to so they can either fix it or add a note in the documentation.
It is currently the expected behaviour that if no new events are coming in, that the registry file is not written which also means no states are cleaned up. The reason is that this would require a scheduler in the registrar and currently I'm not aware of any benefits of writing it frequently. It will automatically clean up the next time events are sent or also in case filebeat is shut down. Agree that we should add a note about this in the docs. Please let me know if you see any other problem with this behaviour.
@Airn5475 I quickly scanned all the comments above and saw that you are worried that a state is removed before filebeat finished reading. This is prevented by filebeat. As long as a harvester is open for a file, the state will not be cleaned as this would lead to very unpredictable behaviour.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.