Parsing human readable log files with grok

jjdepaul · August 4, 2015, 5:13pm

We have the following log sample file that was created in a human readable format. I'm interested in the lines in bold and want to capture the following event fields:

timestamp (i.e. 7/17/15 18:50:59:616 GMT)
status (i.e. success or failure)

Can someone help me write the grok filter to filter out the two fields, please?

                              TraceLogMessage :   NA: CurrencyExchangeRate: An unexpected error occured when retreiving the currency exchange file and processing it.

[7/17/15 18:50:59:616 GMT] 00000688 CurrencyExcha I ***********************************
[7/17/15 18:50:59:616 GMT] 00000688 CurrencyExcha I Currency Exchange Rate File Retreival at 2015/07/17 18:50:59 failure.
[7/17/15 18:50:59:617 GMT] 00000688 CurrencyExcha I *
TraceLogMessage : NA: CurrencyExchangeRate: An unexpected error occured when retreiving the currency exchange file and processing it.
[7/18/15 18:50:59:616 GMT] 00000688 CurrencyExcha I *
[7/18/15 18:50:59:616 GMT] 00000688 CurrencyExcha I Currency Exchange Rate File Retreival at 2015/07/17 18:50:59 success.
[7/18/15 18:50:59:617 GMT] 00000688 CurrencyExcha I *
TraceLogMessage : NA: CurrencyExchangeRate: An unexpected error occured when retreiving the currency exchange file and processing it.
[7/19/15 18:50:59:616 GMT] 00000688 CurrencyExcha I *
[7/19/15 18:50:59:616 GMT] 00000688 CurrencyExcha I Currency Exchange Rate File Retreival at 2015/07/17 18:50:59 failure.
[7/19/15 18:50:59:617 GMT] 00000688 CurrencyExcha I ***********************************

magnusbaeck · August 4, 2015, 6:11pm

So everything's on a single line, i.e.

[7/17/15 18:50:59:616 GMT] 00000688 CurrencyExcha I Currency Exchange Rate File Retreival at 2015/07/17 18:50:59 failure.

rather than

[7/17/15 18:50:59:616 GMT] 00000688 CurrencyExcha I Currency Exchange Rate File Retreival at 
2015/07/17 18:50:59 failure.

or...?

jjdepaul · August 4, 2015, 6:18pm

Yes, everything is on one line, Magnus.

jjdepaul · August 4, 2015, 6:22pm

I was able to make a bit of progress on my own.
Given Input:
[7/17/15 18:50:59:616 GMT] 00000688 CurrencyExcha I Currency Exchange Rate File Retreival at 2015/07/17 18:50:59 failed.

Filter:
%{DATESTAMP:tslice} %{GREEDYDATA} %{WORD:status}

Output:
{
"tslice": [
[
"7/17/15 18:50:59:616"
]
],
"status": [
[
"failed"
]
]
}

I'm not exactly sure why it worked... Not sure how GREDYDATA decided to stop where it did.

This filter I have isn't specific enough. I want the filter to just capture the Success/Failures of the exchange rate retrieval and ignore everything else from that log. As it stands, it would parse any and every line that has a timestamp and that's not what I want.

magnusbaeck · August 4, 2015, 8:18pm

Not sure how GREDYDATA decided to stop where it did.

Because you said that there's a WORD at the end.

This filter I have isn't specific enough. I want the filter to just capture the Success/Failures of the exchange rate retrieval and ignore everything else from that log.

This should be close enough:

filter {
  grok {
    match => [
      "message",
      "^\[%{DATESTAMP:tslice} GMT\] \d+ CurrencyExcha I Currency Exchange Rate File Retreival %{GREEDYDATA} %{WORD:status}$"
    ]
  }
  if "_grokparsefailure" in [tags] {
    drop { }
  }
}

jjdepaul · August 5, 2015, 1:59pm

Magnus - thank you very much. That works great.

Is _grokparserfailure thrown when the parser can't extract one of the fields in the expression? What all is returned in [tags]?

magnusbaeck · August 5, 2015, 2:02pm

The _grokparsefailure tag is added when none of the supplied expressions match the field they're supposed to match.

jjdepaul · August 5, 2015, 2:13pm

Thank you - lastly:

This config I now have is meant to feed only one Index type in ElasticSearch. We may need to feed other Indices from the same log. Do I need to create different config files and start separate LogStash instances for those? Or do I somehow include Input/Filter/Output for those other indices all in the same config file? Please -

magnusbaeck · August 5, 2015, 2:18pm

Look into conditionals.

jjdepaul · August 5, 2015, 2:21pm

Thanks again -

jjdepaul · August 6, 2015, 5:17pm

My logstash config file is working now, I'm parsing the input log file and extracting the right data, all good here. The event looks good in the RUBYDEBUG from logstash. The issue I'm having is that I started redirecting my extracted events to Elasticsearch and now event structure that I get into ES is different than what I defined it using _Mappings definition. Somehow logstash is overriding the document properties.

Here is what it looks like in logstash console (all good):

{
       "message" => "[7/30/15 18:50:59:616 GMT] 00000688 CurrencyExcha I   Currency Exchange Rate File Retreival at 2015/07/30 18:50:59 success",
      "@version" => "1",
    "@timestamp" => "2015-08-06T16:13:38.100Z",
          "host" => "IBM-EN189AKEUJ4",
          "path" => "C:/logstash-1.5.3/inputs/SystemOut1.log",
          "type" => "sysout",
        "tslice" => "7/30/15 18:50:59:616",
        "status" => "success"
}

When it gets to ES it looks terrible - Logstash stuffs everything into the [_source]{message} array, seemingly ignoring the Mapping definition in ES for index 'sprint'... The doc looks like this in ES:

curl -XGET 'localhost:9200/sprint/sysout/AU8DyhBIErJTJM_bP4rF?pretty'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   471  100   471    0     0  31400      0 --:--:-- --:--:-- --:--:-- 31400{
  "_index" : "sprint",
  "_type" : "sysout",
  "_id" : "AU8DyhBIErJTJM_bP4rF",
  "_version" : 1,
  "found" : true,
  "_source":{"message":"[7/30/15 18:50:59:616 GMT] 00000688 CurrencyExcha I   Currency Exchange Rate File Retreival at 2015/07/30 18:50:59 success","@version":"1","@timestamp":"2015-08-06T16:13:38.100Z","host":"IBM-EN189AKEUJ4","path":"C:/logstash-1.5.3/inputs/SystemOut1.log","type":"sysout","tslice":"7/30/15 18:50:59:616","status":"success"}
}

Here is my _Mapping that I've created for 'sprint' index are as follows:

 curl -XGET "localhost:9200/sprint/_mappings?pretty"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   760  100   760    0     0  47500      0 --:--:-- --:--:-- --:--:-- 47500{
  "sprint" : {
    "mappings" : {
      "sysout" : {
        "properties" : {
          "@timestamp" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "@version" : {
            "type" : "string"
          },
          "host" : {
            "type" : "string"
          },
          "message" : {
            "type" : "string"
          },
          "path" : {
            "type" : "string"
          },
          "status" : {
            "type" : "string"
          },
          "tslice" : {
            "type" : "date",
            "format" : "MM/dd/yy HH:mm:ss:SSS"
          },
          "type" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

I should also paste the current config file as it exists right now:

input { 
    # Log file
    file {
       path => "C:/logstash-1.5.3/inputs/SystemOut1.log"
       type => "sysout"
       #start_position => "beginning"
      }
      
      # Standard input file
      #stdin { }
}  

filter {
  grok {
    match => [
      "message",
      "^\[%{DATESTAMP:tslice} GMT\] %{GREEDYDATA} (%{WORD:status}(\.)?)$"
    ]
  }
  if "_grokparsefailure" in [tags] {
    drop { }
  }
}
    
output { 
    elasticsearch {
        host => localhost
        index => "sprint"
    }
    stdout { 
        codec => rubydebug 
    } 
}

Why can't I get the individual fields in sprint/sysout to get populatedproperly? All I really care about in that Document are fields tslice and status...

magnusbaeck · August 6, 2015, 5:36pm

Everything is working fine. When you use the Get API the top-level properties of the returned document contain metadata (_id, _type, ...) and the document itself is found in the _source property. See the example in the documentation.

jjdepaul · August 6, 2015, 6:49pm

I thought I was starting to understand this product....but I was wrong.

I initially created the sprint index by issuing the following command:
curl -XPUT 'localhost:9200/sprint/sysout/1?pretty' -d '
{
"tslice": "15/07/10 07:04:36:321",
"status": "failed"
}'

I didn't even specify message in my document definition properties initially (it was created dynamically by logstash along with these other fields). I did have tslice and status fields in the document at the root level, but they were not populated?! I don't get it.

magnusbaeck · August 6, 2015, 6:53pm

Why are you saying that the tslice and status fields are missing? They're right there, on the line that begins with "_source", near the end.

jjdepaul · August 6, 2015, 6:55pm

I guess I expected them to be present at the document root, just where I defined them:

"sprint" : {
"mappings" : {
"sysout" : {
"properties" : {
"@timestamp" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"@version" : {
"type" : "string"
},
"host" : {
"type" : "string"
},
"message" : {
"type" : "string"
},
"path" : {
"type" : "string"
},
"status" : {
"type" : "string"
},
"tslice" : {
"type" : "date",
"format" : "MM/dd/yy HH:mm:ss:SSS"
},
"type" : {
"type" : "string"
}

magnusbaeck · August 6, 2015, 6:58pm

But they are in the document root. It's the Get API that returns the original document in the _source field. If you run the contents of the _source field through a JSON pretty printer it looks like this:

{
    "message": "[7/30/15 18:50:59:616 GMT] 00000688 CurrencyExcha I   Currency Exchange Rate File Retreival at 2015/07/30 18:50:59 success",
    "@version": "1",
    "@timestamp": "2015-08-06T16:13:38.100Z",
    "host": "IBM-EN189AKEUJ4",
    "path": "C:/logstash-1.5.3/inputs/SystemOut1.log",
    "type": "sysout",
    "tslice": "7/30/15 18:50:59:616",
    "status": "success"
}

That's your original, the source document.

jjdepaul · August 6, 2015, 7:01pm

Huh... I see that now - sorry, I was being difficult. Is there a way to run contents of _source through JSON pretty print in ES?! Or do you do that in something like JSONLint?

magnusbaeck · August 6, 2015, 7:13pm

In this case I used jsonlint.com. Since you supplied ?pretty in the URL I would've expected ES to have pretty printed the document as well.

jjdepaul · August 6, 2015, 9:05pm

Ok - thanks for your patience and advice.

Topic		Replies	Views
Grok filter not extracting fields Logstash	28	3230	May 14, 2018
Some fields missing Logstash	6	1976	July 6, 2017
Grok for parsing java Log Logstash	9	23077	July 6, 2017
Grok in logstash.conf doesnt work! Logstash	6	462	June 4, 2018
Lumberjack loosing grokked fields? Elasticsearch	1	329	July 6, 2017

Parsing human readable log files with grok

Related topics