I am facing an issue with filebeat halting completely after encountering file reading error (ERR File reading error. Stopping harvester. Error: EOF). The use case is explained as below:
We have some logs which are generated and rotated very rapidly. Restriction on the number of rotated log files is around 32. Nomenclature of log files generated are based on hex string sequence(for eg: 10001.log, 10002.log,....1000A.log etc). When I start filebeat initially, the logs are processed fine until some point and starts to throw file reading error after which filebeat halts completely failing to parse or open up harvesters for newly generated log files. See below the error I am getting(with debug enabled):
filebeat:
prospectors:
paths:
- /var/log/*.ACT
encoding: plain # tried both utf-8 and plain
ignore_older: 10m
scan_frequency: 0s
spool_size: 512 # Tried with 1024 and default as well
I initially tried using filebeat-rc1 and noticed this issue. Also tried with the latest nightly build and still the issue persists. I looked through the forum to see if a similar issue was posted but couldn't find any. Please help me get through this issue. We really want to go with filebeat for logs shipping in our solution but struck at the moment because of this issue. Your help will be highly appreciated. Thanks
Setting scan_frequency to 0s is a bad idea. We recommend at least 5s. We had a similar issue in our system tests when we set the value very low which leads to some race conditions. We are trying to fix this but as we don't want that the harvesters can block each other, it is hard to prevent. What is the reason you set scan_frequency to 0s?
The EOF error you see is not the issue here. Even though it prints ERR, it only shows that some files were not changed for 10 minutes so the file handler / harvester will be closed which is good. The error message is improved in the next nightly build (not completely fixed).
Please try to set scan_frequency to at least 5 or 10 seconds and see if the issue still exists.
Thanks for the speedy response. I had in fact tried setting on multiple test runs, scan_frequency to 10s up to 2 min but in vain. I ended up getting the same error message after sometime. The reason I set scan_frequency to 0s was as per the suggestion given by some member of this forum for a somewhat related issue. I understand its a bad idea using 0s and I wouldn't really mind setting it to anything in the range 10 - 20s as long as it works for me.
I did however notice one thing about this EOF error. The files against which the error was reported were actually no longer available as opposed to the time out msg printed in the debug log. And BTW I am on a **Linux** box.
Please let me know if you require any further details and again I appreciate your help in getting this issue resolved. Thanks.
Ok, that is definitively an interesting use case. That the files are deleted should not matter and it is actually good to see that filebeat closes them properly after ignore_older.
Is the above the end of your debug output?
Can you try to the most recent nightly build as it should have some better outputs?
Do you see any pattern when it stops reading?
Based on you your output above I assume filebeat has already processed a few 100 log files?
Is the above the end of your debug output?
[Vinod]: Yes.
Can you try to the most recent nightly build as it should have some better outputs?
[Vinod]: I tried the latest nightly build and the output was the same
Do you see any pattern when it stops reading?
[Vinod]: The following is the preceeding message which appears before the aforementioned logs posted
2015/11/17 07:15:34.125071 preprocess.go:91: DBG preprocessor forward
2015/11/17 07:15:34.125144 output.go:103: DBG output worker: publish 178 events
2015/11/17 07:15:36.269232 filebeat.go:133: DBG Events sent: 178
2015/11/17 07:15:36.269308 registrar.go:99: DBG Registrar: processing 178 events
2015/11/17 07:15:36.269415 registrar.go:135: DBG Write registry file: /var/lib/filebeat/registry
Based on you your output above I assume filebeat has already processed a few 100 log files?
[Vinod]: filebeat would have processed around 20 out of 32 log files. Each log file is around 20 MB.
It takes around 5 - 7 mins before it starts throwing this error and get halted completely.
I am currently available for a screen sharing session. Please let me know how do we get this session started?
Also in the latest nightly build debug option -d "*" doesn't always log the messages. So far I got it logging only once. It stops logging after filebeat startup.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.