Filebeat do not see file updates after a while


#1

Hello,

I am facing a strange issue with Filebeat on Windows :

I use it to consume log files generated with log4net. These files are configures to rotate every day. Filebeat is only watching the current file.

Everything works well (new lines are forwarded to Logstash) until the first hours of the day after tomorrow.

Every day between midnight and 2 or 3 o'clock Filebeat stop seeing file updates.

I think it is not due to file rotation because the first lines of the new day are well forwarded to Logstasg.

But after a while (20 minutes to 2 hours) nothing is forwarded anymore.

And if I restart the windows service, all skipped events are well forwarded and everything goes well until the next day in the morning.

I am using the RC1 of Filebeat and in the logs I can see the the last event published and then these lines till the end :

2015-11-18T02:02:07+01:00 DBG  Start next scan
2015-11-18T02:02:07+01:00 DBG  scan path d:/logs/application.log
2015-11-18T02:02:07+01:00 DBG  Check file for harvesting: d:/logs/application.log
2015-11-18T02:02:07+01:00 DBG  Update existing file for harvesting: d:/logs/application.log
2015-11-18T02:02:07+01:00 DBG  Not harvesting, file didn't change: d:/logs/application.log

There is no error in the log file.

I know that I do not have much info for troubleshoot this issue but if you have any idea about this issue or suggestions to collect more info, do not hesitate !

Some info about my Filebeat config :

filebeat:
  prospectors:
    -
      paths:
      - d:/logs/application.log
      encoding: iso-8859-1
      input_type: log
      ignore_older: 5m

  registry_file: d:/filebeat/filebeat.application.registry
  spool_size: 1

(ruflin) #2

@ogauchard Thanks for reporting this. It seems like this could be also related to: ERR File reading error. Stopping harvester. Error: EOF

So in your case even though additional lines are added to application.log, filebeat tells you that the file didn't change, correct?

So far I assume the issue could be related to a race condition with fast file rotation, but in your case this does not seem to be the case.

Some question that could help to nail down the problem:

  • Are there any specific characters when it stops reading?
  • Is the rotate file deleted when the reading stops?
  • Are you using a shared drive under windows?
  • What is the total number of events sent to ES when the logging stops?
  • You see the above lines: Start next scan, ... multiple times? Means until you kill filebeat?

#3

Thanks for answering so quickly !

YES

Nothing special. This is a standard message that Filebeat correctly handles at other times.

I cannot answer. I have to check the Log4Net source code. I think it is the same mechanism as log4J.

NO

It depends. Sometimes about 40, sometimes 300.

Yes. These lines are repeated until I stop the windows service.


#4

I just had a look at the log4net code. It seems to rename the current file (using move command) and then create a new current file.


(ruflin) #5

@ogauchard Thanks a lot for the details. That helps. I will let you know when I have more details.


(ruflin) #6

@ogauchard We found a potential issue and fixed it. It would be nice if you could try our most recent build here to see if it fixed your issue: https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/


#7

@ruflin I've just installed the latest build and started the services. I'll give you feedback on monday after a few days of processing. Thanks for your efficiency.


(ruflin) #8

Thanks.


#9

Well it's a half a good news. I installed the latest build and restart Filebeat last Friday morning. It ran well and detected all changes till this morning at about 1 am. So it is really better but not totally fixed. As usual, I have restarted the windows service this morning and all events from 1 am to now were detected.

Here comes the logs of the last "good" scan :

2015-11-23T00:42:37+01:00 DBG Start next scan
2015-11-23T00:42:37+01:00 DBG scan path d:/logs/application.log
2015-11-23T00:42:37+01:00 DBG Check file for harvesting: d:/logs/application.log
2015-11-23T00:42:37+01:00 DBG Update existing file for harvesting: d:/logs/application.log
2015-11-23T00:42:37+01:00 DBG Not harvesting, file didn't change: d:/logs/application.log
2015-11-23T00:42:38+01:00 DBG Flushing spooler because of timemout. Events flushed: 0
2015-11-23T00:42:43+01:00 DBG Flushing spooler because of timemout. Events flushed: 0
2015-11-23T00:42:46+01:00 DBG Flushing spooler because spooler full. Events flushed: 1
2015-11-23T00:42:46+01:00 DBG Flushing spooler because spooler full. Events flushed: 1
2015-11-23T00:42:46+01:00 DBG send event
2015-11-23T00:42:46+01:00 DBG Start Preprocessing
2015-11-23T00:42:46+01:00 DBG Flushing spooler because spooler full. Events flushed: 1
2015-11-23T00:42:46+01:00 DBG Publish: {
...
}
2015-11-23T00:42:46+01:00 DBG Forward preprocessed events
2015-11-23T00:42:46+01:00 DBG output worker: publish 1 events
2015-11-23T00:42:46+01:00 DBG Try to publish %!s(int=1) events to logstash with window size %!s(int=1024)
2015-11-23T00:42:46+01:00 DBG %!s(int=1) events out of %!s(int=1) events sent to logstash. Continue sending ...
2015-11-23T00:42:46+01:00 INFO Events sent: 1
2015-11-23T00:42:46+01:00 DBG send event
2015-11-23T00:42:46+01:00 DBG Start Preprocessing
2015-11-23T00:42:46+01:00 DBG Processing 1 events
2015-11-23T00:42:46+01:00 DBG Flushing spooler because spooler full. Events flushed: 1
2015-11-23T00:42:46+01:00 DBG Write registry file: d:\Applis\ELK\data\filebeat\filebeat.histovds.webservice.registry
2015-11-23T00:42:46+01:00 DBG Publish: {
...
}

...

2015-11-23T00:42:46+01:00 DBG Forward preprocessed events
2015-11-23T00:42:46+01:00 DBG output worker: publish 1 events
2015-11-23T00:42:46+01:00 DBG Try to publish %!s(int=1) events to logstash with window size %!s(int=1024)
2015-11-23T00:42:46+01:00 DBG Processing 1 events
2015-11-23T00:42:46+01:00 DBG Write registry file: d:\Applis\ELK\data\filebeat\filebeat.histovds.webservice.registry
2015-11-23T00:42:46+01:00 INFO Registry file updated. 1 states written.
2015-11-23T00:42:46+01:00 DBG %!s(int=1) events out of %!s(int=1) events sent to logstash. Continue sending ...
2015-11-23T00:42:46+01:00 INFO Events sent: 1
2015-11-23T00:42:46+01:00 DBG Processing 1 events
2015-11-23T00:42:46+01:00 DBG Write registry file: d:\Applis\ELK\data\filebeat\filebeat.histovds.webservice.registry
2015-11-23T00:42:46+01:00 INFO Registry file updated. 1 states written.

And after the following sequence is repeated till I stop the service :

2015-11-23T00:42:47+01:00 DBG Start next scan
2015-11-23T00:42:47+01:00 DBG scan path d:/logs/application.log
2015-11-23T00:42:47+01:00 DBG Check file for harvesting: d:/logs/application.log
2015-11-23T00:42:47+01:00 DBG Update existing file for harvesting: d:/logs/application.log
2015-11-23T00:42:47+01:00 DBG Not harvesting, file didn't change: d:/logs/application.log
2015-11-23T00:42:53+01:00 DBG Flushing spooler because of timemout. Events flushed: 0

And there is also this line (which is expected) :

2015-11-23T00:47:51+01:00 ERR File reading error. Stopping harvester. Error: Stop harvesting as file is older then ignore_older: d:/logs/application.log; Last change was: 5m5.011712s

Our servers are VMs on a private cloud. I have to check whether any maintenance operation occurs this morning and could have prevent Filebeat to run correctly.


(ruflin) #10

If I understand it right, the file which is not picked up again is the application.log file. It seems to happen after the first time that the harvester closes the because of ignore_older. Afterwards it changes again, but it seems like filebeat still decides that the file hasn't changed, correct?


#11

This is unfortunately not so simple. I have checked the logs and I can see that harvester has already been closed in previous days due to ignore_older but following changes have been detected.

And I cannot see any abnormal activity in cloud cluster.


(ruflin) #12

Strange. Could you check the next time the harvester is not picking up the file anymore, what the file stats are (create date, last modified)?


#13

OK

I will see tomorrow morning if everything is fine or not. And if it is not I will give you these informations.

It is very strange. Everything was fine on saturday and sunday but the behaviour I encountered before the last build occured this morning.

Just for information, what have you fixed in last build ?


(ruflin) #14

@ogauchard Which two builds are you comparing? rc2 and nightly from friday?


#15

@ruflin rc2 and the build from thursday at 15:33


(ruflin) #16

@ogauchard There is for the moment probably a quite simple solution. Increase ignore_older to 24h or at least higher then the longest update frequency of your files. Can you try that out?


Maybe I missed something: FIlebeat prospectors seems to randomly stop harvesting 'some' files
(ruflin) #17

@ogauchard There were two bugs we fixed. Here in the changelog is the description: https://github.com/elastic/filebeat/blob/master/CHANGELOG.md The full code changes can be seen here: https://github.com/elastic/filebeat/compare/1.0.0-rc2...1.0.0


#18

@ruflin Won't large ignore_older configuration be a problem with file rotation ?


(ruflin) #19

@ogauchard As long as you don't have a very high file rotation rate with thousands of files every 24 hours, it shouldn't be.


#20

@ruflin OK I let the ignore_older to 5 minutes to see what happens tomorrow morning. If it does not work, I'll set it to 24h.