How to handle gzip'd gzip'd (twice) files in S3

markwan · January 23, 2019, 3:38pm

A client is providing AWS cloud trail logs via an S3 bucket. The S3 input plugin will read .gz files.

However the problem is that the file, eg "cloudtrail_log_2019-01-22.gz" is actually gzip'd twice, and should really be named cloudtrail_log_2019-01-22.gz.gz"

I have tried the gzip_lines codec but that assumes each line in a file is gzip'd not the whole file.

Has anyone else seen this problem and found a solution? Right now I'm thinking about a pipeline using s3 input which just unzips and outputs to a file on the local drive, but suffixing with a .gz on the inner gzip file. Then a second pipeline which reads from that using file codec on the new .gz file.

Thoughts? Comments?

markwan · January 31, 2019, 9:55am

Client has changed data. Now the files in S3 are not double gzip'd. However the file does not have a .gz extension. This means s3 input does not automatically recognise data as gzip'd.

Using gzip_lines codec looks like it's looking for each line to be gzip data, not a whole file that is gzip'd. So data cannot be parsed using that codec.

Is the only solution to extend the s3 input plugin to allow an optional is_gzip flag that forces gzip behaviour? or is there a simpler solution available?

system · February 28, 2019, 9:55am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
S3 input with cloudtrail codec not working with gzipped files Logstash	3	2155	July 6, 2017
Can Logstash pull gzipped files from s3 Logstash	4	4918	February 21, 2018
Logstash s3 input plugin not processing gz files Logstash	2	794	May 23, 2019
S3 input plugin is not able to handle encrypted data Logstash	4	1424	October 4, 2018
S3 plugin not functioning correctly for GZ files from Firehose Logstash	1	359	August 9, 2019

How to handle gzip'd gzip'd (twice) files in S3

Related topics