Most recent CSV file from S3 gets read indefinitely

resuni · July 23, 2019, 8:09pm

I'm trying to process CSV files stored in an S3 bucket using Logstash. Everything works fine until it gets to the last file, which it creates entries in Elasticsearch for endlessly.

The data is in daily time buckets, and each CSV file contains data for one day (grouped by various things). If I watch the document count in the Discover section of Kibana, a normal day shouldn't contain many more than 100,000 documents. The most recent day will continue to climb far into the millions before I stop Logstash.

As a troubleshooting step, I've removed the filter block from my config, and I still see the document count on that index go way higher than it should. This confirms it's not an issue with my filters, but perhaps an issue with the way I've configured my S3 plugin.

Here is that reduced config with some sensitive information censored:

input {

  s3 {
    type => "b"
    endpoint => "<s3-compatible storage URL>"
    access_key_id => "<redacted>"
    secret_access_key => "<redacted>"
    bucket => "b"
    sincedb_path => "/var/lib/logstash/plugins/inputs/s3/b.sincedb"
  }

}

output {

  if [type] == "b" {
    elasticsearch {
      hosts => ["localhost:9200"]
      index => "b"
    }
  }

}

I've tried with and without manually specifying the sincedb_path, and it does the same thing.

I don't see what would cause something like this unless there's something fundamental I'm misunderstanding about the S3 plugin. Any thoughts?

Badger · July 23, 2019, 9:15pm

You have not set watch_for_new_files, and it defaults to true, so the s3 input sits in a loop, checking every file in the bucket to see if its last_modified date is newer than the last time it saw that file. If it is, then it adds the file to the list of files it should read.

The way I read the code, if it reads a file, it reads the whole file. It is not storing how far into the file it read, it is just re-reading the file, so if the last_modified date changes, you get duplicate documents. There is an issue for this, but it hasn't been updated in years.

resuni · July 23, 2019, 10:12pm

When a file is written to that S3 bucket, it's the entire file. This happens once a day. In my situation I shouldn't need it to keep track of how far into the file it reads.

I think I want watch_for_new_files to be true, because I want it to be watching for that new file that comes in the next day.

I think my issue is more related to this bug: https://github.com/logstash-plugins/logstash-input-s3/issues/172

resuni · July 24, 2019, 4:47pm

I made the code change that the guy in that issue mentioned, and it fixed my problem.

I don't know if I should mark this thread solved though, because modifying Logstash code doesn't seem like a good solution worth recommending to others.

system · August 21, 2019, 4:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash s3 input reads file multiple times Logstash	1	198	July 21, 2023
S3 Input plugin - Settings to remember the last read line Logstash	2	556	August 15, 2018
File watcher Logstash	1	679	December 17, 2019
Logstash read all CSV Logstash	2	289	November 5, 2019
Logstash is indexing the last line of my csv file in elasticsearch Logstash	3	1250	July 6, 2017

Most recent CSV file from S3 gets read indefinitely

Related topics