GZIP invalid header with Filebeat S3 input and GuardDuty logs

I'm trying to ingest logs from AWS GuardDuty using Filebeat, but I'm getting ERROR messages when trying to run the ingest with Filebeat.

2020-05-20T22:42:27.973Z	ERROR	[s3]	s3/input.go:447	gzip.NewReader failed: gzip: invalid header
2020-05-20T22:42:27.974Z	ERROR	[s3]	s3/input.go:386	createEventsFromS3Info failed for AWSLogs/123456789123/GuardDuty/ca-central-1/2020/05/15/659b5608-a71c-3b42-8979-f851e61d9098.jsonl.gz: gzip.NewReader failed: gzip: invalid header
2020-05-20T22:42:27.974Z	WARN	[s3]	s3/input.go:277	Processing message failed, updating visibility timeout
2020-05-20T22:42:28.005Z	ERROR	[s3]	s3/input.go:447	gzip.NewReader failed: gzip: invalid header
2020-05-20T22:42:28.005Z	ERROR	[s3]	s3/input.go:386	createEventsFromS3Info failed for AWSLogs/123456789123/GuardDuty/ca-central-1/2020/05/14/1557fa6e-f5f8-36ea-add9-28070f1ff7ee.jsonl.gz: gzip.NewReader failed: gzip: invalid header

I have configured GuardDuty to export findings to an S3 bucket. The file contents are json-lines/newline-delimited JSON and they are GZIP-compressed. The metadata on the object is as follows:

Content-Encoding: gzip
Content-Type: application/json

Filebeat is configured using the s3 input plugin. I'm using Filebeat version 7.6.2. I found the line of code in Filebeat that's generating this error, but I can't figure out how to work around it.

If I download the file using aws s3 cp I can see that the file really is gzip-compressed, and it decompresses just fine on my local computer.

Could it be that aws-sdk for Go is automatically decompressing the file? I found this issue #1292 with aws-sdk-go that says the default transport will decompress the object unless gzip is specified as an accepted encoding.

And filebeat is not specifying gzip as an accepted encoding when it calls GetObjectRequest() so maybe filebeat is trying to decompress data that has already been decompressed.

You can try to delete the line with "Content-Encoding". The data that filebeat downloads is not valid GZIP content.

@mtojek Not sure I understand your suggestion?

The object metadata is populated by AWS GuardDuty when it writes objects into the bucket, and Filebeat acts on the new object as soon as it is Put into the bucket because it subscribes to an SQS queue of bucket update notifications.

I could overwrite the metadata for one object, but that doesn't really help with the ingest flow. I can't change the Content-Encoding that is used on new objects created by GuardDuty.

@mtojek I tested your theory by copying one of the files produced by GuardDuty into the same bucket, but with different metadata. This allowed Filebeat to ingest the file normally, and the events are visible in Kibana.

aws --profile root s3 cp --metadata Content-Encoding=faketest s3://my-guardduty-bucket/AWSLogs/123456789123/GuardDuty/ca-central-1/2020/05/14/017c8c8f-7ad6-3da5-9c53-8c08ba35b370.jsonl.gz s3://my-guardduty-bucket/AWSLogs/123456789123/GuardDuty/ca-central-1/2020/05/14/017c8c8f-7ad6-3da5-9c53-8c08ba35b370-3.jsonl.gz

I did not change the content of the file. So it seems like the data in S3 is valid GZIP content, but Filebeat will fail to process that GZIP content if the the header is Content-Encoding: gzip.

I'm still not sure how to make Filebeat work for ingesting GuardDuty logs. Should I have a Lambda that corrupts the Metadata on every object before delivering it to the queue for Filebeat? That does not seem good.

I'm fairly sure this is a bug in the Filebeat S3 input plugin so I reported a bug on the Issue tracker on GitHub.