A client is providing AWS cloud trail logs via an S3 bucket. The S3 input plugin will read .gz files.
However the problem is that the file, eg "cloudtrail_log_2019-01-22.gz" is actually gzip'd twice, and should really be named cloudtrail_log_2019-01-22.gz.gz"
I have tried the gzip_lines codec but that assumes each line in a file is gzip'd not the whole file.
Has anyone else seen this problem and found a solution? Right now I'm thinking about a pipeline using s3 input which just unzips and outputs to a file on the local drive, but suffixing with a .gz on the inner gzip file. Then a second pipeline which reads from that using file codec on the new .gz file.
Client has changed data. Now the files in S3 are not double gzip'd. However the file does not have a .gz extension. This means s3 input does not automatically recognise data as gzip'd.
Using gzip_lines codec looks like it's looking for each line to be gzip data, not a whole file that is gzip'd. So data cannot be parsed using that codec.
Is the only solution to extend the s3 input plugin to allow an optional is_gzip flag that forces gzip behaviour? or is there a simpler solution available?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.