Logstash configuration for Cloudfront logs

Hi Guys,

Help me getting cloudfront logs parsed in logstash. I want each filed to be searchable even parameters like aid, bid, cid etc. See below

Sample cloudfront log

2016-03-29 04:02:08 ABC1 461 22.20.17.8 GET afsaGdhfxghxgh.cloudfront.net /1.gif - Mozilla/5.0%2520(Linux;%2520Android%25205.1.1;%2520SM-G920I%2520Build/LMY47X;%2520wv)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Version/4.0%2520Chrome/48.0.2564.106%2520Mobile%2520Safari/537.36 aid=fsdggg25346&bid=fsdgagsexfdhg&cid=1423690744601076&cb=fdfsdggg&did=fsagdsgg&eid=fDSGzsgdfhdsh - Miss jAk9duSOoOPVfssDGZdfhgxxxghfghpzK35tRuujwuQ== afsaGdhfxghxgh.cloudfront.net https 558 0.715 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss

Working one but I want parameters as well to be searchable

match => { "message" => "%{DATE_EU:date}\t%{TIME:time}\t%{WORD:x_edge_location}\t(?:%{NUMBER:sc_bytes}|-)\t%{IPORHOST:c_ip}\t%{WORD:cs_method}\t%{HOSTNAME:cs_host}\t%{NOTSPACE:cs_uri_stem}\t%{NUMBER:sc_status}\t%{GREEDYDATA:referrer}\t%{GREEDYDATA:User_Agent}\t%{GREEDYDATA:cs_uri_stem}\t%{GREEDYDATA:cookies}\t%{WORD:x_edge_result_type}\t%{NOTSPACE:x_edge_request_id}\t%{HOSTNAME:x_host_header}\t%{URIPROTO:cs_protocol}\t%{INT:cs_bytes}\t%{GREEDYDATA:time_taken}\t%{GREEDYDATA:x_forwarded_for}\t%{GREEDYDATA:ssl_protocol}\t%{GREEDYDATA:ssl_cipher}\t%{GREEDYDATA:x_edge_response_result_type}" }

Not working

match => { "message" => "%{DATE_EU:date}\t%{TIME:time}\t%{WORD:x_edge_location}\t(?:%{NUMBER:sc_bytes}|-)\t%{IPORHOST:c_ip}\t%{WORD:cs_method}\t%{HOSTNAME:cs_host}\t%{NOTSPACE:cs_uri_stem}\t%{NUMBER:sc_status}\t%{GREEDYDATA:referrer}\t%{GREEDYDATA:User_Agent}\t(?[A-Za-z0-9$.+!'|(){},~@#%&/=:;_?-[]<>^`])?)?)\t%{GREEDYDATA:cookies}\t%{WORD:x_edge_result_type}\t%{NOTSPACE:x_edge_request_id}\t%{HOSTNAME:x_host_header}\t%{URIPROTO:cs_protocol}\t%{INT:cs_bytes}\t%{GREEDYDATA:time_taken}\t%{GREEDYDATA:x_forwarded_for}\t%{GREEDYDATA:ssl_protocol}\t%{GREEDYDATA:ssl_cipher}\t%{GREEDYDATA:x_edge_response_result_type}" }

Logstash Configuration

input {
file {
path => "/opt/cloudfront/E2I53NO2J8KEJZ*"
type => "cloudfront"
start_position => "beginning"
sincedb_path => "log_sincedb"
}
}

filter {
if [type] == "cloudfront" {
if ( ("#Version: 1.0" in [message]) or ("#Fields: date" in [message])) {
drop {}
}

            grok {

                                            match => { "message" => "%{DATE_EU:date}\t%{TIME:time}\t%{WORD:x_edge_location}\t(?:%{NUMBER:sc_bytes}|-)\t%{IPORHOST:c_ip}\t%{WORD:cs_method}\t%{HOSTNAME:cs_host}\t%{NOTSPACE:cs_uri_stem}\t%{NUMBER:sc_status}\t%{GREEDYDATA:referrer}\t%{GREEDYDATA:User_Agent}\t(<params>\?[A-Za-z0-9$.+!*'|(){},~@#%&/=:;_?\-\[\]<>\^\`]*)?)?)\t%{GREEDYDATA:cookies}\t%{WORD:x_edge_result_type}\t%{NOTSPACE:x_edge_request_id}\t%{HOSTNAME:x_host_header}\t%{URIPROTO:cs_protocol}\t%{INT:cs_bytes}\t%{GREEDYDATA:time_taken}\t%{GREEDYDATA:x_forwarded_for}\t%{GREEDYDATA:ssl_protocol}\t%{GREEDYDATA:ssl_cipher}\t%{GREEDYDATA:x_edge_response_result_type}" }


            }
    }
            mutate {
                    add_field => [ "received_at", "%{@timestamp}" ]
                    add_field => [ "listener_timestamp", "%{date} %{time}" ]
            }

            date {
                    match => [ "listener_timestamp", "yy-MM-dd HH:mm:ss" ]
            }
                    if [params] {
            mutate {
                    rename => { "params" => "params[request]" }
            }
            urldecode {
                    field => "params[request]"
            }
            kv {
                    source => "params[request]"
                    field_split => "?&"
                    target => "params"
            }
    ruby {
            code => "
            arguments = Array.new
            event['params'].to_hash.each {|k,v|
    if k == 'request' then
    next
  end
  arguments << { 'key' => k, 'value' => v }
}
unless arguments.empty?
  event['[arguments]'] = arguments
end

"
remove_field => [ "params" ]
}

    }

}

output {
stdout { codec => rubydebug }

}

So it's the cs_uri_stem field that contains the data you want to parse further? Keep the origin expression that works and use a kv filter to parse cs_uri_stem.

And try to avoid having multiple GREEDYDATA patterns in the same expression. It might seem to work but it can easily blow up later. If the fields are tab-delimited why not use the csv filter to extract the fields instead of grok?

Yes cs_uri_stem contains the data that I want to parse but if you look my above configuration you will find that there are two cs_uri_stem. so I changed one cs_uri_stem to cs_uri and used kv as suggested by you as below but after the logs got loaded to elasticsearch I am not able search the parameters for example aid="xxxxxxxx" AND bid="yyyyyyyyyyy"

input {
file {
path => "/opt/cloudfront/E2I53NO2J8KEJZ*"
type => "cloudfront"
start_position => "beginning"
sincedb_path => "log_sincedb"
}
}

filter {
if [type] == "cloudfront" {
if ( ("#Version: 1.0" in [message]) or ("#Fields: date" in [message])) {
drop {}
}

            grok {
                    match => { "message" => "%{DATE_EU:date}\t%{TIME:time}\t%{WORD:x_edge_location}\t(?:%{NUMBER:sc_bytes}|-)\t%{IPORHOST:c_ip}\t%{WORD:cs_method}\t%{HOSTNAME:cs_host}\t%{NOTSPACE:cs_uri}\t%{NUMBER:sc_status}\t%{GREEDYDATA:referrer}\t%{GREEDYDATA:User_Agent}\t%{GREEDYDATA:cs_uri_stem}\t%{GREEDYDATA:cookies}\t%{WORD:x_edge_result_type}\t%{NOTSPACE:x_edge_request_id}\t%{HOSTNAME:x_host_header}\t%{URIPROTO:cs_protocol}\t%{INT:cs_bytes}\t%{GREEDYDATA:time_taken}\t%{GREEDYDATA:x_forwarded_for}\t%{GREEDYDATA:ssl_protocol}\t%{GREEDYDATA:ssl_cipher}\t%{GREEDYDATA:x_edge_response_result_type}" }
            }
    }
            mutate {
                    add_field => [ "received_at", "%{@timestamp}" ]
                    add_field => [ "listener_timestamp", "%{date} %{time}" ]
            }

            date {
                    match => [ "listener_timestamp", "yy-MM-dd HH:mm:ss" ]
            }
            if [cs_uri_stem] {
                    mutate {
                            rename => { "cs_uri_stem" => "cs_uri_stem[request]" }
                    }
                    urldecode {
                            field => "cs_uri_stem[request]"
                    }
                    kv {
                            source => "cs_uri_stem[request]"
                            field_split => "?&"
                            target => "cs_uri_stem"
                    }
            ruby {
                    code => "
                    arguments = Array.new
                    event['cs_uri_stem'].to_hash.each {|k,v|
                    if k == 'request' then
                            next
                    end
                    arguments << { 'key' => k, 'value' => v }
                    }
                    unless arguments.empty?
                    event['[arguments]'] = arguments
            end
            "
            remove_field => [ "cs_uri_stem" ]
            }
    }

}

output {
stdout { codec => rubydebug }
}

Do you think csv would be better than grok in this use case?

but after the logs got loaded to elasticsearch I am not able search the parameters for example aid="xxxxxxxx" AND bid="yyyyyyyyyyy"

What do the resulting events look like? Please show the output of the stdout { codec => rubydebug } output.

{
"message" => "2016-03-29\t04:02:08\tABC1\t461\t22.20.17.8\tGET\tafsaGdhfxghxgh.cloudfront.net\t/1.gif\t-\tMozilla/5.0%2520(Linux;%2520Android%25205.1.1;%2520SM-G920I%2520Build/LMY47X;%2520wv)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Version/4.0%2520Chrome/48.0.2564.106%2520Mobile%2520Safari/537.36\taid=fsdggg25346&bid=fsdgagsexfdhg&cid=1423690744601076&cb=fdfsdggg&did=fsagdsgg&eid=fDSGzsgdfhdsh\t\t Miss\tjAk9duSOoOPVfssDGZdfhgxxxghfghpzK35tRuujwuQ==\tafsaGdhfxghxgh.cloudfront.net\thttps\t558\t0.715\t\t TLSv1.2\tECDHE-RSA-AES128-GCM-SHA256\tMiss",
"@version" => "1",
"@timestamp" => "2016-03-29T01:04:12.000Z",
"path" => "/opt/cloudfront/E2I53NO2J8KEJZ",
"host" => "localhost",
"type" => "cloudfront",
"date" => "16-03-27",
"time" => "01:04:12",
"x_edge_location" => "ABC1",
"sc_bytes" => "461",
"c_ip" => "22.20.17.8",
"cs_method" => "GET",
"cs_host" => "afsaGdhfxghxgh.cloudfront.net",
"cs_uri" => "/1.gif",
"sc_status" => "200",
"referrer" => "-",
"User_Agent" => "Mozilla/5.0%2520(Linux;%2520U;%2520Android%25204.2.2;%2520en-gb;%2520SM-T110%2520Build/JDQ39)%2520AppleWebKit/534.30%2520(KHTML,%2520like%2520Gecko)%2520Version/4.0%2520Safari/534.30",
"cookies" => "-",
"x_edge_result_type" => "Miss",
"x_edge_request_id" => "jAk9duSOoOPVfssDGZdfhgxxxghfghpzK35tRuujwuQ==",
"x_host_header" => "afsaGdhfxghxgh.cloudfront.net",
"cs_protocol" => "https",
"cs_bytes" => "581",
"time_taken" => "0.042",
"x_forwarded_for" => "-",
"ssl_protocol" => "TLSv1",
"ssl_cipher" => "ECDHE-RSA-AES128-SHA",
"x_edge_response_result_type" => "Miss",
"received_at" => "2016-03-29T10:16:46.815Z",
"listener_timestamp" => "16-03-27 01:04:12",
"arguments" => [
[0] {
"key" => "aid",
"value" => "432432546376879869"
},
[1] {
"key" => "bid",
"value" => "reawca54rsyxdfhgtf"
},
[2] {
"key" => "cid",
"value" => "gzsdfhbxdfhx35q"
},
[3] {
"key" => "did",
"value" => "35434w65474ew"
},
[4] {
"key" => "eid",
"value" => "43r536w456"
}
]
}

After loading this log, I see "arguments" field as not indexed and hence not searchable.

You probably don't want arguments to be an array of objects. Searches are not going to work like you expect them to. Instead, I suggest you aim for

"arguments": {
  "aid": "432432546376879869",
  "bid": "reawca54rsyxdfhgtf",
  ...
}

which is what the kv filter should give you out of the box.

I used just kv filter and it is working as expected but it looks like while searching in kibana the count of log and ES data is different. Is it because I am using kv filter? Do we any alternative of kv filter to achieve what you told in your last response.

You have to be more specific than "while searching in kibana the count of log and ES data is different".

I don't think the kv filter has anything to do with this.