Logstash S3 Input - How does it know where to start?

If I define an S3 bucket as input, and logstash processes a few before being turned off, how does it know where to begin so that it doesn't process the same data again? For example, if I have 2 files in my S3 bucket , and logstash has already pushed them to my specified output (elasticsearch) , it won't re-process them even if i delete the files from the output (delete the elasticsearch index).

From what I see, the sincedb file stores a date (keeps track of the date the last handled file was added to S3). Does logstash use this date to figure out what files have been processed? What if N files were pushed into s3 during the same time, and I stopped logstash when it was done processing N-1 files, would it re-process all N files again when I start logstash again?

Thank you

From looking at the source, I see that the plugin modifies the sincedb with a file's last modified timestamp when it is done reading the file.

It compares against this timestamp when listing the files to local memory, to determine which files are new.

Theoretically, this means that if N>1 files are uploaded to S3 with the same timestamp, once the first file is read, if the Logstash process is quit or exits before the remaining log files at identical timestamp are consumed, when Logstash starts back up, they will be skipped.

Which explains this bug filed back in October 2015 :weary:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.