Best configuration to avoid message loss due to inode reuse?

(Graham Allan) #1

I am using Filebeat to read a set of files logged using the Java Logback library. Files are named with the format "prefix-%d{yyyy-MM-dd}.%i.json", where %i is an index that is incremented when Logback rolls over. Logback is configured to remove files, based on total size and time. With this naming strategy, and Filebeat searching for any file matching "prefix-*" there is no need to rename a file when Logback rolls over, so once a file is created it will always have the same name, until it's removed. That seemed to be all well and good.

However, we are observing inode reuse on the filesystem (we're using ext4) and I am not sure how to configure the clean_* and close_* options to handle this scenario. It seems like with the way we have Filebeat configured, the deletion of file X, and reuse of the inode number by file Y is interpreted as rename from X to Y. In our scenario, renames and moves never happen, only creates and deletes.

I believe this is also compounded when we had issues in our pipeline that would cause Logstash to stop processing. It looks like when this happens, Filebeat is unable to ship any logs, and harvesters are not created or progressed, but meanwhile the application is still logging to new files and deleting old ones, potentially reusing inodes. I've been able to reproduce on the first attempt by stopping Filebeat then triggering enough logging in the application to cause a lot of rollover, as a proxy for what happens when Logstash is not available. Given the reproducibility, it doesn't seem like a "one in a million" kind of thing. Granted, we want Logstash to remain up, but if we do get downtime, we'd like to avoid data loss caused by inode reuse.

We're using the following settings (I think these are all that's relevant, let me know if more is needed):
close_inactive: 5m
close_renamed: false
close_removed: true
close_eof: false
clean_inactive: 0
clean_removed: true
close_timeout: 0

With Filebeat 5.2.2

Which are all just the defaults we've taken from the Filebeat puppet module we're using, and I think the defaults of Filebeat itself.

In reading the config, I expected clean_removed to be the answer, but it was already enabled when this problem occurred. Then I thought it looked like clean_inactive is the right setting, but I'm a bit worried by what inactivity means. I'm happy to remove files once Filebeat has shipped them to Logstash, but I don't want to clean files which are inactive, but are not yet processed by Filebeat. In reading the doc, I believe I have to set ignore_older, which goes by modification timestamp, in conjunction with clean_inactive. I'm worried that a direct comparison of current time vs. modification timestamp is disconnected from what Filebeat has actually processed. To prevent inode reuse, I'd have to set ignore_older to be short enough that Filebeat would begin ignoring files not because they're old, but because Filebeat can't ship them downstream, or was shutdown for whatever reason. Once Logback deletes files, that data is gone forever, and that's the trade off we're choosing, ideally Filebeat could be configured in such a way as to just follow Logback, and not have another set of semantics to understand.

What is the best way to configure Filebeat to recognise what would look to it like a rename (different files using the same inode) as a new file, and never attempt to harvest from an inode using a filename that no longer exists?

(ruflin) #2

I think what you are looking for is clean_inactive. In case a LS output is blocked and not the completee file is sent, filebeat will keep the file open and finish sending it even it logback deleted it (is is then actually not deleted). So all your data will be sent. Also during this time the inode cannot be reused as it still taken by a file.

With the above in mind, it can still happen that a new file is generated with the same inode. In most logging cases this is not a problem, because the size of the new file is smaller then the old file and filebeat will assume then it is a new file and starts reading from scratch.

So your way of reproducing it is not representative how filebeat works, as filebeat was stopped. Make sure to keep Filebeat running during your tests so Filebeat can do it's job.

(Graham Allan) #3

Hey, thank you so much for responding. Really appreciate it. I have some follow up questions that hopefully make sense.

The clean_inactive option does sound like it addresses my issues; it hadn't occurred to me that of course a running Filebeat process can keep a file handle even though the application that logs has removed it.

My issue with clean_inactive is that it forces me to pick a setting for ignore_older that I don't currently have set, and don't really want. It's possible (likely even) that I've misunderstood something, but I can't think of a reasonable setting for ignore_older. The Logback config can be paraphrased as "Delete logs after they're 4 days old or reach a total of 1G, whichever comes first. Rollover daily, or on reaching 100M, whichever comes first". With that config, some of our lightest producers only rollover and delete based on date, but some of our heaviest producers rollover based on size, typically every 30 minutes or so, which means a log file is created and deleted in 5 hours. That timeframe is also dynamic, based on volume, so any static constant is always going to be a best effort guess.

Note I'm also trying to accomodate for some unpredictable outage where Filebeat has stopped for an extended period, so may be confusing myself trying to configure Filebeat to handle it. Perhaps instead I should look at either tweaking Logback's rollover policy, or trying to detect extended periods of Filebeat downtime and recover externally. Ironically, for all the uses cases that tracking inodes enables, it's almost like this use case would be better met if Filebeat completely ignored inodes and only went by file path (not that I'm suggesting doing that, just an observation).

So given the desire to use clean_inactive, and no other reason to use ignore_older, what would be an appropriate setting for ingore_older?

(ruflin) #4

I think one of the problems is that you are looking for one ignore_older period, but you can set different once for each environment / prospector as you seem to know quite well how each behaves. ignore_older is required for clean_inactive as otherwise complete logs would be sent again.

In general I recommend to figuring out why LS is blocking and why filebeat cannot send files for a long time. Filebeat keeps working when there is no connection (even for longer times) but it is not the "recommended" setup :slight_smile:

About the path: I agree that in some cases having path as identifier could be useful, especially in cases where the name never changes. For this even a different prospector type could be used because it would actually make things easier as following files etc. over rotation can become tricky.

(Graham Allan) #5

Thanks for all your help.

I've opted for a single configuration across our applications of:

clean_inactive = min(time from creation to deletion) - 1hr. The heaviest producer creates and deletes files roughly every 5 hours, so this is set at 4h for now.
ignore_older = clean_inactive + 1minute

I'll give that a while to see how that compares in terms of inode reuse.

You're right I was looking for a single, one-size-fits-all configuration. Not because I thought it wasn't possible to configure differently, I just feel it will be easier for us humans to understand a single policy across our nodes.

Thanks for your help, much appreciated.

(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.