Rotated files are named like file.log.1, file.log.2, ....
Now, for most of the cases it works fine. But once in a while filebeat misses one whole file. Like a complete one and not even processing a single even for that rotated log file.
Can you post a ls output and mark the ones that were not read, perhaps we can spot a pattern. Also, posting the filebeat logs (preferably at debug level) might be useful for us to understand what's going on.
Thanks for your reply @tudor . The problem is that the issue is erratic and happens suddenly sometime. After reading your reply I changed the log level to "info" from "warning" ("debug" would be too much for prod I thought). With this I realised that info logs are stored in file for some time and after sometime the file is empty and it does not store any log. That is why I was not able to prepare log file for you.
With a further investigation I found that we use qsrotate (http://mod-qos.sourceforge.net/qsrotate.1.html) along with filebeat to rotate filebeat logs. And this double rotation of logs (one by filebeat keepfile property and other by qsrotate) was probably preventing filebeat to log. Now I have switched off qsrotate and we are relying on only filebeat for its log rotation. And with this we are still monitoring our production environment for the main issue (skipping processing of a few log file). Till the time we are monitoring, a question for you. Do you think that qsrotate might have caused the main issue as well?
The output errors should not lead to data loss as filebeat is keeping the file open until it is completely sent.
In your log file above, I didn't see anything related to filebeat potentially missing a file. Could you update to filebeat 5.x and see if you still see the issue? The file handling is quite different in 5.x and fixed a lot of previous issues.
About the beats logging: That is also something we should improve on our side and are working on.
You told The output errors should not lead to data loss as filebeat is keeping the file open until it is completely sent. But what I observed is that a complete file is missed and not a partial file. So that would mean that a harvestor never opened a particular file for processing, which was a bit scary.
Any suggestions, what could we do with current setup (i.e. current version of filebeat) since it is not easy for us to suddenly upgrade in production environment like that.
How fast are you logs rotating? Means how long does it one log file to get rotated over on peak time. My assumption is that you hit sometimes a race condition. One option to make it less likely is to decrease scan_frequency and for example only have - "/path/to/file.log in the pattern. We normally not recommend to do this as when you restart filebeat, it can still pick up old files. But this reduces the risk of race condition again as less files are tracked. Also it will still finish the file on renaming.
The biggest problem in 1.x is that there are no clean_* options so the registry file grows over time which make it likelier for you to hit an inode reuse. You could write a script to manually clean it up, but I don't recommend it.
So I still strongly recommend to upgrade even though I cannot be 100% sure by above assumption is correct (nothing like this in the log). Upgrade to 5.x should be rather simple, but yes test it before going into production
Thanks @ruflin. Upgrade did fix our issue. We finally found that the issue we were running into was https://github.com/elastic/beats/issues/1974 . Unfortunately, migrating from 1.3 to 5.3 also came out to be a bit painful. We missed https://www.elastic.co/guide/en/beats/libbeat/5.0/breaking-changes-5.0.html where the config were changed from tls to ssl for filebeat config. But the issue is that filebeat -configtest also does not give error with tls configuration (I think it should be filed as an issue) and we took some time to figure this out also.
But finally everything works fine now. Thanks a lot for your help
Glad it's working for you now. About -configtest: This one is quite tricky as this mainly does validate the yaml file an some of the config options. But it does not check if there are "too many" config options ... But we are aware of the issue and try to improve it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.