How to get JSON data from AWS s3 bucket

Hello,

ELK 7.7.0
Logstash 7.7.0

I've been trying to do the below:

I have a set up in which the source data is placed in to a s3 bucket. Due to the bucket config the files are saved as .txt. However, the data inside these files are actually JSON format. Have used (https://jsonformatter.org/) to validate the JSON and its all well and ok, correct formatting.
Using the s3 input in logstash to input this data is proving a struggle. Have used a couple of different methods:

  • Used the JSON codec and JSON_Lines codec in the s3 input but this keeps reading the JSON line by line.
  • Used the PLAIN codec but then this also keeps reading the JSON line by line.

An example of the JSON data is below (Winlogbeat data)

{
"@timestamp": "2020-06-02T13:27:08.041Z",
"@version": "1",
"message": "The is the message field with details of the data in it. Does have a couple \n and \t in the message. Same data really as what is under the winlog field.",
"tags": [
"winlogbeat-input"
],
"winlog": {
  "provider_guid": "XXXXXXXXXXXXX",
"record_id": 31983869,
"event_id": 5156,
"process": {
  "pid": 4,
  "thread": {
    "id": 3764
  }
},
"opcode": "Info",
"channel": "Security",
"computer_name": "XXXXXXXXXXXXX",
"event_data": {
  "SourceAddress": "XXXXXXXXXXXXX",
  "LayerName": "%%14611",
  "RemoteMachineID": "XXXXXXXXXXXXX",
  "SourcePort": "XXXXXXXXXXXXX",
  "ProcessID": "1936",
  "FilterRTID": "0",
  "Protocol": "XXXXXXXXXXXXX",
  "RemoteUserID": "XXXXXXXXXXXXX",
  "Direction": "%%14593",
  "DestAddress": "XXXXXXXXXXXXX",
  "LayerRTID": "XXXXXXXXXXXXX",
  "DestPort": "XXXXXXXXXXXXX",
  "Application": "XXXXXXXXXXXXX"
},
"version": 1,
"keywords": [
  "Audit Success"
],
"task": "XXXXXXXXXXXXX",
"api": "wineventlog",
"provider_name": "Microsoft-Windows-Security-Auditing"
}
}

From my understanding of Logstash, with using the JSON codec, this would be able to be parsed correctly or event with the use of the JSON filter to pull out information like from the [message] field or even from the [winlog] field or even [winlog][event_data] as source. But this doesn't seem to be the case.

Am I missing something?
All help would be greatly appreciated.
Thank you!

Each line of a file is treated as a separate event. If you want to combine lines use a multiline codec.

Hey @Badger,
Thank you for your reply!

Have been looking into this and have tried a couple of things. I have tired multiline codec but it didn't work. Maybe its just the matter of trying different patterns to see what one works?

I did do some looking and found a cool post you did about parsing JSON with multiline codec (Parsing array of json objects with logstash and injesting to elastic) I did try this also but didn't work.

Think maybe a pattern would be to match out different { and/or }? or you think it will be a fact out playing around with patterns with \n etc?

If you want to consume the entire file as a single event then that is easy, you can just use a pattern that never matches. If your file contains multiple JSON objects then if it is pretty printed you can use ^} to find the end of the object. If it is not pretty printed you may not be able to find a pattern that works.

1 Like

@Badger ahh thank very much!
Can't believe I over looked this simple thing :sweat_smile:
Thank you for your help

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.