ERR File reading error. Stopping harvester. Error: EOF

I am facing an issue with filebeat halting completely after encountering file reading error (ERR File reading error. Stopping harvester. Error: EOF). The use case is explained as below:

We have some logs which are generated and rotated very rapidly. Restriction on the number of rotated log files is around 32. Nomenclature of log files generated are based on hex string sequence(for eg: 10001.log, 10002.log,....1000A.log etc). When I start filebeat initially, the logs are processed fine until some point and starts to throw file reading error after which filebeat halts completely failing to parse or open up harvesters for newly generated log files. See below the error I am getting(with debug enabled):

2015/11/17 07:15:34.125071 preprocess.go:91: DBG preprocessor forward
2015/11/17 07:15:34.125144 output.go:103: DBG output worker: publish 178 events
2015/11/17 07:15:36.269232 filebeat.go:133: DBG Events sent: 178
2015/11/17 07:15:36.269308 registrar.go:99: DBG Registrar: processing 178 events
2015/11/17 07:15:36.269415 registrar.go:135: DBG Write registry file: /var/lib/filebeat/registry
2015/11/17 07:18:36.138599 log.go:251: DBG Stopping harvesting of file as older then ignore_old: %!(EXTRA string=/var/log/1000846.ACT, string=Last change was: , time.Duration=10m5.007949554s)
2015/11/17 07:18:36.138733 log.go:126: ERR File reading error. Stopping harvester. Error: EOF
2015/11/17 07:18:36.139992 log.go:251: DBG Stopping harvesting of file as older then ignore_old: %!(EXTRA string=/var/log/100084B.ACT, string=Last change was: , time.Duration=10m5.007038985s)
2015/11/17 07:18:36.140034 log.go:126: ERR File reading error. Stopping harvester. Error: EOF
2015/11/17 07:18:36.140526 log.go:251: DBG Stopping harvesting of file as older then ignore_old: %!(EXTRA string=/var/log/100084D.ACT, string=Last change was: , time.Duration=10m5.006297913s)
2015/11/17 07:18:36.140632 log.go:126: ERR File reading error. Stopping harvester. Error: EOF
2015/11/17 07:18:36.140649 log.go:251: DBG Stopping harvesting of file as older then ignore_old: %!(EXTRA string=/var/log/100084C.ACT, string=Last change was: , time.Duration=10m5.006218238s)
2015/11/17 07:18:36.140948 log.go:126: ERR File reading error. Stopping harvester. Error: EOF
2015/11/17 07:18:36.144508 log.go:251: DBG Stopping harvesting of file as older then ignore_old: %!(EXTRA string=/var/log/100084E.ACT, string=Last change was: , time.Duration=10m5.006633198s)
2015/11/17 07:18:36.144603 log.go:126: ERR File reading error. Stopping harvester. Error: EOF

The config I am using is as below:

filebeat:
prospectors:
paths:
- /var/log/*.ACT
encoding: plain # tried both utf-8 and plain
ignore_older: 10m
scan_frequency: 0s
spool_size: 512 # Tried with 1024 and default as well

output:
logstash:
enabled:true
hosts: ["localhost:5044"]

I initially tried using filebeat-rc1 and noticed this issue. Also tried with the latest nightly build and still the issue persists. I looked through the forum to see if a similar issue was posted but couldn't find any. Please help me get through this issue. We really want to go with filebeat for logs shipping in our solution but struck at the moment because of this issue. Your help will be highly appreciated. Thanks

Hi @vinod8427

Thanks for reporting this. A few things here:

  • Setting scan_frequency to 0s is a bad idea. We recommend at least 5s. We had a similar issue in our system tests when we set the value very low which leads to some race conditions. We are trying to fix this but as we don't want that the harvesters can block each other, it is hard to prevent. What is the reason you set scan_frequency to 0s?
  • The EOF error you see is not the issue here. Even though it prints ERR, it only shows that some files were not changed for 10 minutes so the file handler / harvester will be closed which is good. The error message is improved in the next nightly build (not completely fixed).

Please try to set scan_frequency to at least 5 or 10 seconds and see if the issue still exists.

I assume you are on Linux?

Hi ruflin,

      Thanks for the speedy response. I had in fact tried setting on multiple test runs, scan_frequency to 10s up to 2 min but in vain. I ended up getting the same error message after sometime. The reason I set scan_frequency to 0s was as per the suggestion given by some member of this forum for a somewhat related issue. I understand its a bad idea using 0s and I wouldn't really mind setting it to anything in the range 10 - 20s as long as it works for me.
      I did however notice one thing about this EOF error. The files against which the error was reported were actually no longer available as opposed to the time out msg printed in the debug log. And BTW I am on a **Linux** box.
      Please let me know if you require any further details and again I appreciate your help in getting this issue resolved. Thanks.

@vinod8427 How rapidly are your log files generated and rotated?

By "files were not available" you mean that these files were already deleted?

Yes. Files were already deleted by then.
Depending on the load every file is generated and rotated in less than 30s may be.

Ok, that is definitively an interesting use case. That the files are deleted should not matter and it is actually good to see that filebeat closes them properly after ignore_older.

  • Is the above the end of your debug output?
  • Can you try to the most recent nightly build as it should have some better outputs?
  • Do you see any pattern when it stops reading?

Based on you your output above I assume filebeat has already processed a few 100 log files?

Please see my response inline:

Is the above the end of your debug output?
[Vinod]: Yes.
Can you try to the most recent nightly build as it should have some better outputs?
[Vinod]: I tried the latest nightly build and the output was the same
Do you see any pattern when it stops reading?
[Vinod]: The following is the preceeding message which appears before the aforementioned logs posted
2015/11/17 07:15:34.125071 preprocess.go:91: DBG preprocessor forward
2015/11/17 07:15:34.125144 output.go:103: DBG output worker: publish 178 events
2015/11/17 07:15:36.269232 filebeat.go:133: DBG Events sent: 178
2015/11/17 07:15:36.269308 registrar.go:99: DBG Registrar: processing 178 events
2015/11/17 07:15:36.269415 registrar.go:135: DBG Write registry file: /var/lib/filebeat/registry
Based on you your output above I assume filebeat has already processed a few 100 log files?
[Vinod]: filebeat would have processed around 20 out of 32 log files. Each log file is around 20 MB.

How long does it take until filebeat gets stuck?
Would you be available for a screen sharing session so we could have a look at the issue together?

It takes around 5 - 7 mins before it starts throwing this error and get halted completely.
I am currently available for a screen sharing session. Please let me know how do we get this session started?
Also in the latest nightly build debug option -d "*" doesn't always log the messages. So far I got it logging only once. It stops logging after filebeat startup.

We found the potential problem and fixed it. This will be part of the next release and is already part of the nightly builds: https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/

@vinod8427 For the growing registrar file, please follow this issue: https://github.com/elastic/filebeat/issues/269