Filebeat focus resources on oldest file

When any part of the ELK stack bottlenecks, then files begin to build up in the filebeat sector. When this happens, filebeat holds the files open even after deletion with file rotates. The problem i see here is that as files rotate, filebeat does not discriminate and simply appears to diverge resources to all open files.

What happens is a cascading, or exponential, effect. The farther behind filebeat gets, the more files are held open, the less any single file receives focus of resources. This means all open files are worked slower and no files are released and disk space is chewed up by old open files. If the oldest file was given dedicated resources, it could be closed out and that disk space released to the host in order to receive more logs.

For example, if there is 1 open file, all resources are dedicated to sending that file to logstash. Once logs rotate, 2 files are no held open by filebeat, and now only 50% of filebeats "resources" are dedicated to the original file, which means it will now take longer for that file to be processed. Now the file rotates yet again, and the original file is still open and now has even fewer "resources" to work it. That means the disk space is taken up by 3 files all being worked in parallel by filebeat and maximum hard drive space utilized. If, instead, filebeat held the files open, yet dedicated most, more, all resources to the oldest file then that file could be worked with priority and closed out much sooner, freeing up the disk space consumed by said file to go back to the system.

The more filebeat falls behind (not filebeats fault necessarily), the exponentially more it will fall behind. This may not be a problem for 2,3 or 4 files, but when you talk about 30 or 40 files held open it becomes a real problem. it could take an hour or more before an open file lock is released since it is getting lilttle focus in the overall scheme.

I hope i have explained my concerns clearly, and look forward to other's thoughts or solutions.

1 Like

There is currently not scheduling or prioritisation inside filebeat as you correctly identified above. Two potential indirection solutions of the above could be using harvester_limit and close_timeout. That means new harvesters are only started if old ones are closed. And close_timeout will make sure old files will be close, but could lead to data loss in case a file is removed before it is picked up again.

With harvester_limit filebeat will still focus on the files which were opened first (probably oldest over time).

I'm curious to hear if you hit some limits with 30-40 files open? Because I'm aware of installations with a much larger number of open files without issues. When you reference to 50% of resources, which part are you referring to? Memory, CPU, Disk usage?

Hi, no issues with open files directly. I think in the end it was 118 open files before i had to stop due to disk space.

As for "50% resources" I dont necessarily mean any particular resource but ELK stack as a whole. In my case the issue is due to a bottleneck on storage i/o. the idea would be to allow calls to "queue" in the files, though in our scenario disk space fills up due to too many open files not freeing up the space.

In this scenario we would be able to stay "alive" longer if filebeat processed oldest file first, thus releasing that file sooner and freeing that disk space back up, as opposed to processing all at the same time, and thus not releasing any - not in a timely manner, anyway.

what i saw was the files slowly close out. down to 117, then to 116, etc very slowly until about 30 files open, when "resources" were more focused and number of files closed out more quickly.

the main problem being the more open files filebeat processes, the slower each file is processed, causing even more files to be retained open exponentially.

Obviously the root of the problem is elastic not able to process quickly enough, however if the idea of filebeat is to queue logs in the files themselves, perhaps a solution to process a single file (oldest would be preferable in my case) at a time per prospect.

In other words, if i am prospecting log.log, which then rotates to log.log.1 and finally deleted, filebeat could focus on the oldest file of that rotation (in our 3 open file sccenario it would be log.log.1(deleted). This way, even though we would continue to fall behind, we would be freeing up disk space in a quicker and cleaner fashion.

I am not sure if this is a possibility, i have not looked into the source code, but simply offering up my observation of behavior and my thoughts on an improved scenario. I hoppe this makes sense.

Did you try the suggested config options above, especially harvester_limit? I think in most cases it will do what you expect.

One thing you have to be aware of is that every time you get way behind on reading logs, there is a chance to loose your logs if filebeat didn't open it yet. Obviously there is a tradeoff beween loosing log lines and sanity of the edge node.

How many edge nodes with filebeat do you have? As you don't seem to hit the limit on the filebeat side, I would recommend you to horizontally scale elasticsearch. How many nodes do you currently have? How many events do you get on the filebeat side?

I will look into some of the other settings, but currently we already have a cron solution to free up resources at the expense of losing logs. We cannot scale horizontally as it is our storage solution having the issue, which we are currently not able to expand.

I was merely looking to bring up the issue, though it appears the current solution on the filebeat side would be to either set harvester_limit, close_older, ignore_older or some combination of those.

Anyway, thank you for your responses, they have given me some ideas moving forward and further understanding of filebeat.

In case the output is blocked, the most important one is close_timeout. Since 5.3 of filebeat it will also close the file handler if the output is blocked.

This topic was automatically closed after 21 days. New replies are no longer allowed.