Logstash only uploading SOME files to s3

So we have a setup that works pretty well: Data comes in through the beats plugin, and gets shipped to s3.

I have it setup so that it outputs the local files to different directories, so that they stay within their limits for sizes, times, and separated out. They are then set to upload the files to s3 to various buckets for processing (we partition it out so that over the weekend, our batch jobs do not overrun all of the other data, and they can queue up as necessary). Call it logical load balancing, if you will...

The issue is is that a few of the nodes just suddenly stop sending SOME of the files. We restart logstash and they all get uploaded, no sweat. But about 10-15 minutes later, they are stuck and not uploading. We ended up cron scheduling a service logstash restart to get the files pushed in a timely manner (once an hour).

Any thoughts on this? The Logstash config file for the nodes uploading the data to s3 is below.

Errors seen on the Logstash node include:
{:timestamp=>"2016-05-25T14:00:25.038000-0700", :message=>"S3: have found temporary file the upload process crashed, uploading file to S3.", :filename=>"ls.s3.phx7b02c-543d.stratus.phx.ebay.com.2016-05-25T13.18.part18.txt", :level=>:warn}
{:timestamp=>"2016-05-25T14:00:25.038000-0700", :message=>"S3: have found temporary file the upload process crashed, uploading file to S3.", :filename=>"ls.s3.phx7b02c-543d.stratus.phx.ebay.com.2016-05-25T13.41.part41.txt", :level=>:warn}
{:timestamp=>"2016-05-25T14:02:20.665000-0700", :message=>"S3: Cannot delete the temporary file since it doesn't exist on disk", :filename=>"ls.s3.phx7b02c-543d.stratus.phx.ebay.com.2016-05-25T14.01.part1.txt", :level=>:warn}
{:timestamp=>"2016-05-25T14:02:24.304000-0700", :message=>"S3: AWS error", :error=>#<AWS::S3::Errors::BadRequest: An error occurred when parsing the HTTP request.>, :level=>:error}
{:timestamp=>"2016-05-25T14:02:25.101000-0700", :message=>"S3: AWS error", :error=>#<AWS::S3::Errors::BadRequest: An error occurred when parsing the HTTP request.>, :level=>:error}
{:timestamp=>"2016-05-25T14:04:20.052000-0700", :message=>"S3: Cannot delete the temporary file since it doesn't exist on disk", :filename=>"ls.s3.phx7b02c-543d.stratus.phx.ebay.com.2016-05-25T14.03.part3.txt", :level=>:warn}
{:timestamp=>"2016-05-25T14:05:20.074000-0700", :message=>"S3: Cannot delete the temporary file since it doesn't exist on disk", :filename=>"ls.s3.phx7b02c-543d.stratus.phx.ebay.com.2016-05-25T14.04.part4.txt", :level=>:warn}
{:timestamp=>"2016-05-25T14:05:21.935000-0700", :message=>"S3: AWS error", :error=>#<AWS::S3::Errors::BadRequest: An error occurred when parsing the HTTP request.>, :level=>:error}

We have Logstash servers in AWS that are reading these files to then process and send out to ES and also another s3 archive bucket. No errors noted on them.

input {
beats {
type=> beats
port => 9990
codec => "json"
}
beats {
type=> beats
port => 9991
codec => "json"
}

} #End input
output {

if "ACC" in [opp]
{

s3 {
access_key_id => "redact"
secret_access_key => "redact"
region => "us-west-2"
bucket => "redact"
canned_acl => "authenticated_read"
size_file => 50000000
time_file => 1
upload_workers_count => 20
prefix => "1-phx7b02c-543d-acc/"
codec => "json_lines"
temporary_directory => "/data/logstash/forwarder-acc"
restore => true
}
} # End acc if

Final }

}

As an aside, I forgot to mention:

  1. Monster hardware (24 cores, 72gb RAM)
  2. Running Logstash 2.3.1
  3. Network is IP -> NAT IP -> AWS (but it works). I am wondering if there is a network disconnect causing the the upload_workers to puke or hang, and then only once it is restarted, can they actually upload the files again.
  4. It works swimmingly well on 2 other machines, with very minimal files missed (but there are a few).
  5. The backup queue can at times add up to several GB in the hour (possibly more, depending on the load that day), but Logstash seems to handle it well at startup (re-uploading the files) and also receiving the files, without issue.
  6. There are around 150 Filebeat nodes connecting to the machine streaming data at one time.

Thanks!