Most recent CSV file from S3 gets read indefinitely

I'm trying to process CSV files stored in an S3 bucket using Logstash. Everything works fine until it gets to the last file, which it creates entries in Elasticsearch for endlessly.

The data is in daily time buckets, and each CSV file contains data for one day (grouped by various things). If I watch the document count in the Discover section of Kibana, a normal day shouldn't contain many more than 100,000 documents. The most recent day will continue to climb far into the millions before I stop Logstash.

As a troubleshooting step, I've removed the filter block from my config, and I still see the document count on that index go way higher than it should. This confirms it's not an issue with my filters, but perhaps an issue with the way I've configured my S3 plugin.

Here is that reduced config with some sensitive information censored:

input {

  s3 {
    type => "b"
    endpoint => "<s3-compatible storage URL>"
    access_key_id => "<redacted>"
    secret_access_key => "<redacted>"
    bucket => "b"
    sincedb_path => "/var/lib/logstash/plugins/inputs/s3/b.sincedb"
  }

}

output {

  if [type] == "b" {
    elasticsearch {
      hosts => ["localhost:9200"]
      index => "b"
    }
  }

}

I've tried with and without manually specifying the sincedb_path, and it does the same thing.

I don't see what would cause something like this unless there's something fundamental I'm misunderstanding about the S3 plugin. Any thoughts?

You have not set watch_for_new_files, and it defaults to true, so the s3 input sits in a loop, checking every file in the bucket to see if its last_modified date is newer than the last time it saw that file. If it is, then it adds the file to the list of files it should read.

The way I read the code, if it reads a file, it reads the whole file. It is not storing how far into the file it read, it is just re-reading the file, so if the last_modified date changes, you get duplicate documents. There is an issue for this, but it hasn't been updated in years.

When a file is written to that S3 bucket, it's the entire file. This happens once a day. In my situation I shouldn't need it to keep track of how far into the file it reads.

I think I want watch_for_new_files to be true, because I want it to be watching for that new file that comes in the next day.

I think my issue is more related to this bug: https://github.com/logstash-plugins/logstash-input-s3/issues/172

I made the code change that the guy in that issue mentioned, and it fixed my problem.

I don't know if I should mark this thread solved though, because modifying Logstash code doesn't seem like a good solution worth recommending to others.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.