Parsing mail from (AWS) S3

Jackson_Pollock · June 26, 2017, 5:09am

I have email (text files) in s3 that is being forwarded via SES. These are simple text files with the contents of a single email per file. My logstash below:

input {
s3 {
access_key_id => "xxxxxxx"
secret_access_key => "dxxxxxx"
bucket => "fs-mailit"
type => "s3-access"
region => "us-west-2"
sincedb_path => "/tmp/mailit-logs"
interval => 120
delete => true
}
}

output {
stdout { codec => rubydebug }
}

It works, but reads one line at a time like:

{
"@timestamp" => 2017-06-26T04:31:22.584Z,
"@version" => "1",
"message" => "Return-Path: jackson@gmail.com\r\n",
"type" => "s3-access"
}
{
"@timestamp" => 2017-06-26T04:31:22.590Z,
"@version" => "1",
"message" => "Received: from mail-qt0-f169.google.com (mail-qt0-f169.google.com [209.85.216.169])\r\n",
"type" => "s3-access"
}

And there-in lies the issue. I would like to parse out the the usual items: from, to, subject, data and create a single document in elastic.

Looking for your advice on the best way to do this --
Any way to make s3 slurp the entire file into json rather than one line at a time?

Maybe add a unique identifier that is persistent across the entire file?

A nice email reader plugin would be really helpful so I don't have to manually grok each line.

Many thanks!
-Steve

guyboertje · June 26, 2017, 8:47am

The S3 input is line oriented (from the code read_file(filename) do |line|), its designed to read log lines from files stashed in S3.

The multiline codec is designed to accumulate + assemble those lines and, based on some rules, will join lines into a larger string of text to put into each event. This may be of use to you.

Your problem will be in configuring the rules. You need to know what characters will mark the beginning or end of each email in a "stream" of lines (for this imagine that all your email files in S3 were concatenated into one v big file). Also the start or end characters can be specified as a regular expression and will be consumed i.e. they will not appear in the assembled text.

If you post three or four redacted emails here then maybe I can help. Hint: choose v short emails and use triple backticks on a new line to top and tail each email when you post them.

system · July 24, 2017, 8:47am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Parse Amazon S3 access log with multiple files Logstash	8	2984	July 6, 2017
S3 input logs are not being parsed line by line Logstash	1	392	June 25, 2019
Parsing Emails with logstash Logstash	10	3343	May 18, 2018
Read *.json.gz from AWS S3 bucket Logstash	3	1543	January 25, 2022
S3 input missing files Logstash	4	2393	March 6, 2019

Parsing mail from (AWS) S3

Related topics