Filebeat s3 cannot parse jsonl file who's content-type is set to application/json

Hello! I was hoping y'all could help me out.

The essence of my problem is that the filebeat S3 input plugin cannot process an s3 object who's content-type is application/json AND the object content is a separate json object per line (i.e. jsonl). Processing such an object used to be possible until v7.7.0 when the S3 input plugin started enforcing json parsing if it saw a content-type of application/json: https://github.com/elastic/beats/blob/5e69e25b920e3d93bec76a09a31da3ab35a55607/x-pack/filebeat/input/s3/input.go#L432
Before that, json processing was ONLY controlled by the expand_event_list_from_field configuration.

Now, it's probably the case that content-type on the s3 object should NOT be application/json in the first place but I do not have control over that :frowning:. I'm essentially dealing with the same problem as https://github.com/elastic/beats/issues/18696 but cloudflare is the entity pushing the logs to s3 (where for him it is AWS GuardDuty) and I don't have control over how cloudflare sets content-type.

I reproduced this on the latest version of filebeat (I compiled locally):
filebeat version 8.0.0 (amd64), libbeat 8.0.0 [3341c1bca5626d1ee90af692617f10f58695ed1c built 2020-06-30 20:15:28 +0000 UTC]. Here are the full steps with some info (like AWS account numbers) omitted:

$ cat s3filebeat.log 
{"id": "0001", "hey": "there", "how": {"are": "you"}}
{"id": "0002", "hope": "you", "are": {"doing": "well"}}
{"id": "0003", "I": "am", "doing": {"O": "K"}}

$ gzip s3filebeat.log 

$ aws --profile PROFILE s3api put-object --body ./s3filebeat.log.gz --bucket lucas-test-filebeat-s3 --content-encoding gzip --content-type application/json --key s3filebeat.log.gz
{
    "ETag": "\"955ed9f01b6ee38dbba167daab9ebbbb\""
}

$ cat filebeat.yml 
filebeat.inputs:
- type: s3
  queue_url: "https://sqs.us-east-1.amazonaws.com/ACCTNUM/cloudflare_logs_dev"
  role_arn: "arn:aws:iam::ACCTNUM:role/cloudflare_logs_filebeat_access_s3_sqs_dev"

output.console:
  pretty: true

$ ./filebeat -e
2020-06-30T15:39:04.180-0500	INFO	instance/beat.go:628	Home path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat] Config path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat] Data path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data] Logs path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/logs]
2020-06-30T15:39:04.180-0500	INFO	instance/beat.go:636	Beat ID: 2ef98d6a-ef7c-4885-a258-b4341a63b43b
2020-06-30T15:39:04.181-0500	INFO	[seccomp]	seccomp/seccomp.go:124	Syscall filter successfully installed
2020-06-30T15:39:04.181-0500	INFO	[beat]	instance/beat.go:964	Beat info	{"system_info": {"beat": {"path": {"config": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "data": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data", "home": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "logs": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/logs"}, "type": "filebeat", "uuid": "2ef98d6a-ef7c-4885-a258-b4341a63b43b"}}}
2020-06-30T15:39:04.181-0500	INFO	[beat]	instance/beat.go:973	Build info	{"system_info": {"build": {"commit": "3341c1bca5626d1ee90af692617f10f58695ed1c", "libbeat": "8.0.0", "time": "2020-06-30T20:15:28.000Z", "version": "8.0.0"}}}
2020-06-30T15:39:04.181-0500	INFO	[beat]	instance/beat.go:976	Go runtime info	{"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":12,"version":"go1.13.3"}}}
2020-06-30T15:39:04.182-0500	INFO	[beat]	instance/beat.go:980	Host info	{"system_info": {"host": {"architecture":"x86_64","boot_time":"2020-06-22T10:53:41-05:00","containerized":false,"name":"lgroenendaal-XPS-15-9570","ip":["127.0.0.1/8","::1/128","192.168.86.24/24","fe80::c38c:a654:ca3e:d0dd/64","10.20.211.109/32","fe80::d2ee:dd77:2943:aa45/64","172.17.0.1/16"],"kernel_version":"5.3.0-59-generic","mac":["9c:b6:d0:c6:01:39","02:42:e2:23:9b:84"],"os":{"family":"debian","platform":"ubuntu","name":"Ubuntu","version":"18.04.2 LTS (Bionic Beaver)","major":18,"minor":4,"patch":2,"codename":"bionic"},"timezone":"CDT","timezone_offset_sec":-18000,"id":"5a8843fa712d481595ebd41926cda45f"}}}
2020-06-30T15:39:04.183-0500	INFO	[beat]	instance/beat.go:1009	Process info	{"system_info": {"process": {"capabilities": {"inheritable":null,"permitted":null,"effective":null,"bounding":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"ambient":null}, "cwd": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "exe": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/filebeat", "name": "filebeat", "pid": 5578, "ppid": 4156, "seccomp": {"mode":"filter","no_new_privs":true}, "start_time": "2020-06-30T15:39:03.640-0500"}}}
2020-06-30T15:39:04.183-0500	INFO	instance/beat.go:298	Setup Beat: filebeat; Version: 8.0.0
2020-06-30T15:39:04.183-0500	INFO	[publisher]	pipeline/module.go:113	Beat name: lgroenendaal-XPS-15-9570
2020-06-30T15:39:04.184-0500	WARN	beater/filebeat.go:151	Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-06-30T15:39:04.184-0500	INFO	[monitoring]	log/log.go:118	Starting metrics logging every 30s
2020-06-30T15:39:04.184-0500	INFO	instance/beat.go:449	filebeat start running.
2020-06-30T15:39:04.184-0500	WARN	beater/filebeat.go:251	Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-06-30T15:39:04.184-0500	INFO	registrar/registrar.go:145	Loading registrar data from /home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data/registry/filebeat/data.json
2020-06-30T15:39:04.184-0500	INFO	registrar/registrar.go:152	States Loaded from registrar: 0
2020-06-30T15:39:04.184-0500	INFO	[crawler]	beater/crawler.go:71	Loading Inputs: 1
2020-06-30T15:39:04.185-0500	WARN	[cfgwarn]	s3/input.go:131	BETA: s3 input type is used
2020-06-30T15:39:04.185-0500	INFO	[crawler]	beater/crawler.go:141	Starting input (ID: 18222034013403473169)
2020-06-30T15:39:04.185-0500	INFO	[crawler]	beater/crawler.go:108	Loading and starting Inputs completed. Enabled inputs: 1
2020-06-30T15:39:04.186-0500	INFO	[s3]	s3/input.go:173	visibility timeout is set to 300 seconds
2020-06-30T15:39:04.186-0500	INFO	[s3]	s3/input.go:174	aws api timeout is set to 2m0s
2020-06-30T15:39:04.186-0500	INFO	[s3]	s3/input.go:196	s3 input worker has started. with queueURL: https://sqs.us-east-1.amazonaws.com/374144443638/cloudflare_logs_dev
2020-06-30T15:39:14.633-0500	ERROR	[s3]	s3/input.go:458	expand_event_list_from_field parameter is missing in config for application/json content-type file
2020-06-30T15:39:14.633-0500	ERROR	[s3]	s3/input.go:393	createEventsFromS3Info failed processing file from s3 bucket "lucas-test-filebeat-s3" with name "s3filebeat.log.gz": expand_event_list_from_field parameter is missing in config for application/json content-type file

The imporant error is the last line: s3/input.go:393 createEventsFromS3Info failed processing file from s3 bucket "lucas-test-filebeat-s3" with name "s3filebeat.log.gz": expand_event_list_from_field parameter is missing in config for application/json content-type file.

If I try, just for fun, to include the expand_event_list_from_field config it will, understandably, fail to parse and we'll get the WARN log: s3/input.go:542 decode json failed for 's3filebeat.log.gz' from S3 bucket 'lucas-test-filebeat-s3', skipping this file: json: cannot unmarshal string into Go value of type []interface {}.

For the time being I'll probably use an older version of this plugin (unless you don't think this behavior will EVER be supported again in which case I'll have to do something else). Also, I'd be happy to create a GH issue if it helps!

Phew that was long, thanks for those who stuck with me,
Lucas

Personally I think the appropriate change would be to ignore content-type: application/json and have expand_event_list_from_field be the ONLY thing controlling whether or not to parse the object content as JSON (which is how the logic used to be). I defer to you maintainers though because you have the vision for how you want the code to behave.

Made this GH issue and this is getting traction over there: https://github.com/elastic/beats/issues/19902

This has been resolved! :tada: https://github.com/elastic/beats/pull/19962

Thanks to the wonderful @Kaiyan_Sheng for fixing it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.