S3 input with cloudtrail codec not working with gzipped files

TL/DR: I can't seem to get the s3 input with cloudtrail codec working if the file is gzipped (which is the default for cloudtrail). It does work if I download the file, unzip it, and upload it back into a different S3 bucket.

Details:

I am usiing logstash 2.2.2.

I started out with a normal cloudtrail bucket created by AWS, and a simple config like this:

input {   
    s3 {
        bucket => "cloudtrail-logs" 
        codec => cloudtrail {}
    }
}

output {
    stdout { codec => rubydebug }
}

When I run logstash with --debug, I see this:

S3 input: Adding to objects[] {:key=>"AWSLogs/blahblah/CloudTrail/us-east-1/2016/03/15/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"116", :method=>"list_new_files"}
S3 input processing {:bucket=>"cloud-analytics-platform-cloudtrail-logs", :key=>"AWSLogs/blahblah/CloudTrail/us-east-1/2016/03/15/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"150", :method=>"process_files"}
S3 input: Download remote file {:remote_key=>"AWSLogs/blahblah/CloudTrail/us-east-1/2016/03/15/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :local_filename=>"/var/folders/8f/1bjm5vq53c73tjq0yl4560dj1r5f6h/T/logstash/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"344", :method=>"download_remote_file"}
Processing file {:filename=>"/var/folders/8f/1bjm5vq53c73tjq0yl4560dj1r5f6h/T/logstash/blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz", :level=>:debug, :file=>"logstash/inputs/s3.rb", :line=>"182", :method=>"process_local_log"}
Pushing flush onto pipeline {:level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
Pushing flush onto pipeline {:level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}
Pushing flush onto pipeline {:level=>:debug, :file=>"logstash/pipeline.rb", :line=>"450", :method=>"flush"}

And it just keeps printing that last line over and over and never does anything else. If I go look in /var/folders/8f/1bjm5vq53c73tjq0yl4560dj1r5f6h/T/logstash/ I do indeed see a gzipped file, blahblah_CloudTrail_us-east-1_20160315T1520Z_qQm4gunsTnNuJosk.json.gz .

Now, if I unzip this file, and create myself a test bucket, and put the unzipped file into the test bucket, and run logstash pointing at my test bucket, it works fine!

According to the docs at https://www.elastic.co/guide/en/logstash/current/plugins-inputs-s3.html , if the filename ends in .gz then the s3 input should handle it automatically.

What could be the problem? I am pulling my hair out here.

I'm also having problems with pulling CloudTrail logs using an S3 Input into Logstash 2.2.2, with the "logstash-codec-cloudtrail". It does work, but incredibly slowly with a very high CPU load.

I have three S3 inputs, each watching a separate "region" sub-folder in the S3 bucket for CloudTrail logs. I have set a different ".sincedb" file for each S3 Input so that I can keep an eye on progress by watching the timestamp in the files. During the day some of the 5-minute CloudTrail GZ files in S3 can each contain between 3000-4000 event records, and the GZ file is around 250kB. It seems, from a thread dump, that most of the CPU time is spent on the Ruby Gzip handler. A 5-minute CloudTrail GZ log file from S3 can take between 10 and 20 minutes to get through the Logstash pipeline and into Elasticsearch. The delay is not the Elasticsearch output; all the CPU is consumed by the "[base]S3" thread shown in "top".

My workaround has been to setup a batch job that fetches new CloudTrail GZ files, runs them through "jq" to break them out into events, i.e. splitting them out of the parent "Records" array that CloudTrail uses by default, and then copying the unzipped JSON files to another S3 location. Using Logstash with an S3 Input, JSON codec and Elasticsearch Output rips through these uncompressed JSON files in no time. I added a filter to match the date against the "eventTime" field. The slow component(s) seem to be the S3 Input when used in conjunction with the "cloudtrail" codec.

Clive,

Thanks for confirming that I'm not crazy.

I went with a different solution; I decided to run traildash. It basically replaces logstash + s3-input + cloudtrail-codec + elasticsearch-output.

However, hopefully some day this gets fixed, since I'd rather just use plain old logstash+ plugins where I have ultimate control.