Filebeat does not poll files in order

I was testing a new Beat I created (Rebeat) with a large queue of log files, where Filebeat would pick up the logs and send them over to my beat. During debugging I confirmed that there is no particular (human friendly) order that Filebeat polls the files and sends the log entries.

I wanted to open a Github issue about this but github notes I should discuss it here first.

I recommend that Filebeat organizes the prospectors so the files are polled and sent in a consistent and expected manner. For example:

  • 1000 files as input (as an example)
  • Sorted by file datetime ascending up to the hour
  • then Sorted by full path name (directory+filename)

I have an additional feature request, but if I need to open 2 separate github issues then I might as well discuss it in this thread as well. It would be really useful if Filebeat had the option to rotate based on timestamp (YYYY-dd-mm or epoch) rather than a rolling sequence number suffix.

thanks

Filebeat is a multithreaded application and it will likely be concurrently reading from multiple files. And as such there isn't a practical manner to read the files in order. Why do you need this behavior?

That's seems like a reasonable enhancement. I think I would personally rather see Beats integrate better logrotate so that you can manage rotation however you want. Beats would just need watch for sighup from logrotate to close the current logfile and create a new log file.

An example use case is where a host is collecting logs on a network, and for whatever reason it is unable to transfer logs via Filebeat for a few days, then resumes. If the goal is to get those logs into Elasticsearch and to be made searchable at a later date, then order is not a problem. But if you're sending to Logstash, Graylog, or Kafka, and those systems are presenting data on a map or running datetime sensitive analytics, etc. then out of order reception becomes a problem.

Another example use case is a cluster of syslog servers receiving logs for a few thousands hosts spread out across a WAN, then saving to a file queue for Filebeat to pick up from. You're going to come across some enterprise customers who are just now putting in the Elastic stack and aren't going to be able to install Filebeat on every host, but they may be able to install it on a few hosts that are already aggregating logs.

Speaking of concurrently reading multiple files - that reminds me I came across another problem I asked about in another thread. It looks like Filebeat is limited to how many files it can have open. Originally I thought it was just a problem with ulimit, so I raised the ulimit, but then observed that Filebeat still threw errors and eventually died.

I propose that Filebeat not only sort the prospectors, but also limit how many prospectors are open to maintain stability. If there were 1000 files in a queue, then Filebeat would sort the 1000 files, open the first 10 in order, finish those out, then open the next 10 etc. That way it can gracefully handle situations like this.

I think the timestamp from the log message should be used for any kind of analysis where the date is important. There aren't any guarantees that Filebeat will read the line at the same time it is produced so parsing this info out of the log line is necessary.

The file descriptor limit imposed by ulimit is usually first issue people experience when reading from a large number of files. There is a harvester_limit that can be used to limit the number concurrent harvesters (file readers). Do you have a link to the thread? Was a bug opened for the issue?

I take that back, I looked at the other thread where I noted an issue with Filebeat (the thread wasn't specifically about Filebeat but another Beat I was writing) and this was the relevant snippet:

"Maybe this helps - in my Filebeat debug output, it notes All prospectors are initialized and running with 2043 states to persist. The failure occurs at file 1017. My ulimit was 1024. I went ahead and set this to 8192 and tried the test again - file open errors are gone, but the result is the same. After file 1020, I get:"

Having solved my issue with the Beat I was working on and looking at the debug output again, the errors were all related to the issue I had with my Beat. So Nevermind. :flushed:

I'll give the harvester_limit a try.

Right I agree about timestamps, I would never assume logs come in in any precise order. But the situation I'm trying to avoid is where a queue of 3 days of logs are being harvested, and the data received is bouncing between different days and hours. I know that ultimately this is a cosmetic issue, but if presenting this stream of logs to a user whether via Kibana or Graylog, it looks wrong if a batch of events from 2017-10-20 16:00:00 show up, then 2017-10-19 23:30:00, then 2017-10-21 03:10:00, etc.

Another situation which is less cosmetic is if there is a best effort analysis being done on a stream of events where 1 specific event needs to be flagged first, and then a related following event is flagged (e.g. a request to a specific website, followed by a response). These 2 events would be in the same logs, logged in order, but maybe broken up between 2 different files due to log rotation.

Another situation which is order dependent is when timestamps are used as a trigger for a batch job. For example, some kind of report is generated for 2017-10-22 when events start coming in for 2017-10-23. If there is a connection problem on the sending host for a few days, and it recovers, if the harvesters send 3 days of logs in an un-ordered manner, ES gets a batch of events for the 3rd which prematurely triggers the batch jobs for the previous 2 days.

Another situation has to do with expiry. Let's say Filebeat is set to only harvest 48 hours of logs, and the connection is over very limited bandwidth. There is a spike in activity, causing a backlog that takes a few days to recover from. As the log files get older, they start to fall out of the harvest age limit. Rather than pseudorandom reception of events, we end up with pseudorandom gaps in data.

:chipmunk:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.