That could be an inode reuse issue. There are links to various issues in the META issue 211. Especially see 251.
Tracking which files have been read when those files can get rotated is an extremely hard problem. Way harder than most folks would initially think. A good option to get it right is to checksum the file contents (although this is not foolproof), and the file input does not do that, because it can get ridiculously expensive. Instead it implements a very cheap technique that almost always gets it right (but in a few cases it decides it has already read a file that it has not read).
There are other cases where it gets it wrong by duplicating data. As I said, it is a really hard problem.
We are using Logstash1.5 in some legacy pipelines and interestingly we haven't seen this data loss during logrotation in that pipeline.
What is the main difference in Logstash1.5 vs Logstash7.6 to handle log rotation in input file plugin?
We have tried below options:
we have tried wildcard in path to handle the logrotation issue in LS 7.6, however it created huge number of duplicates(re-read all last 24 hour events) and we reverted back
path => "/weblogs/biweb.log*"
We have tried tuning the pipeline parameters listed below, but no improvements on the data loss
pipeline.workers from 2 to 12 (default: 24)
pipeline.batch.size from 125 to 250 (default: 125)
Currently sincedb_clean_after and sincedb_write_interval is not set, it is using default values(sincedb_clean_after: 2 weeks, sincedb_write_interval : 15sec). Do any of these property tuning will help?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.