I'm trying to figure out an official answer as to how filebeat handles this hypothetical scenario, because I can't seem to find an answer in the documentation.
Presume we have an hourly log that filebeat reads and sends to Logstash. Then logstash goes down for a few hours, but filebeat hasn't finished reading that file so it waits until it reconnects to logstash
If ignore older is set to a low number, say 1 hour, after 1 hour, will filebeat let go of the handler for the initial file even though it hasn't finished sending the log to logstash yet? In that case, because nothing would be writing to that file, is the rest of the data on that file dropped?
That's a good question and to answer it 100% correct, I would have to try it out. The easy answer is that in the future we probably won't have this theoretical problem anymore because of the introduction of close_older (https://github.com/elastic/beats/pull/718). This makes it possible to set ignore_older to infinity by default.
In case you still need ignore_older set to an hour I assume the following would happen. Please be aware that I didn't test it:
Filebeat tries infinitely to reach Logstash with all the lines it has already read from the file. The harvester would be blocked until the lines are sent. So the last line the first line that was read before LS went down, would be sent as soon as LS is available again. I would expect the harvester to finish reading the file as ignore_older is only checked on a prospector level.
THB I would have to check in detail if there could be a potential race condition between the prospector and harvester setting the current offset. There were quite a few race conditions in the 1.0.* release which should be fixed in the upcoming 1.1 release.
If somehow possible, I recommend not getting into the above case. Perhaps @steffens can add some more details here.
BTW: Would you still have this issue with the close_older feature?
Well, the specific issue would be that we wouldn't want it to be closed by close _older. We need to ensure that all the lines of the log are sent regardless of whether or not it has been changed in the last hour. If logstash goes down for a few hours, the file shouldn't get dropped by filebeat because then the rest of the log could be lost.
Close_older only closes the file handler but will not change the offset (compared to ignore_older). As soon as Logstash becomes available again, latest after scan_frequency the file is picked up again at its previous position and the lines are sent again. So no lines are lost (which could be the case with ignore_older).
Oh, I forgot to write a specific detail. The reason I'm talking about "losing data" is because we have an hourly log delete process. If filebeat holds onto the file, logdelete will ignore it. However, if logstash goes down for a few hours and filebeat drops the file handler after 30 minutes, even though the file isn't done streaming, the log delete process will delete the file because it isn't open anymore.
So, we're trying to answer the question. If logstash goes down and filebeat is in mid log. Will filebeat still honor the ignore_older setting and let go of the log, so that when the log delete process runs, any leftover data will be lost?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.