Hello there.
Frankly, I've given up on the S3 input plugin entirely. We are attempting to ingest ELB access logs via S3. They're always a little behind anyway in AWS - by about 5 minutes - so it's not possible to keep them real-time anyway. We had a lot of serious issues with the S3 input plugin, so I figured I would just develop a quasi-custom solution and drop them into a local filesystem directory -- normal old directory within an ext4 filesystem -- when they're ready to be processed, which is what I'm doing now.
However, it doesn't matter. There are always, ALWAYS files left behind after some period of running. At first i thought it was due to the amount of workers I was using (-w 32), so I dropped it to just two workers after we'd processed several days of data, to keep up with the new data moving forward.
Still, files would be left behind. After debugging, I discovered these files would be listed within the "since db" but not processed. I am using grok + the metadata provided by the file input plugin to dump the filename as a field, and sure enough, it would be in the "sincedb", but nothing output to elastic or elsewhere (and no errors from logstash or elastic).
Finally, I tried to make it as easy as possible for logstash to get it right, with the following config:
file {
type => "elb-logs"
sincedb_path => "/dev/null"
file_completed_action => "log_and_delete"
file_completed_log_path => "/home/admin/logstash/x-ingested.log"
file_sort_by => "path"
mode => "read"
path => "/home/admin/logstash/incoming-logs/**/*.gz"
start_position => "beginning"
}
Still, ... as I am writing this, logstash has again left behind 7 files. Ctrl+C and then restarting logstash using the exact same configuration, it will then suddenly "see" the previously missed files. It certainly isn't a glob/matching issue. The sincedb is disabled. What on earth could this be?
Small update: Just wanted to add, as I mentioned, restarting logstash results in instant processing of the files. This happens nearly every few "batches" of files which occur in approx 5 min intervals. All files are named uniquely (and if they weren't, by some issue of AWS's, they would be in the file_completed_log_path, which they are not).
I attempted to touch
the files, to see if logstash would pick them up, then. No go. I just changed the max_open_files
value to a very low value of only 10, since it's only 20-30 files needing processed at a time for the moment. If that helps any, I'll update again, ... But this is becoming quite the pain.