Parsing human readable log files with grok

We have the following log sample file that was created in a human readable format. I'm interested in the lines in bold and want to capture the following event fields:

timestamp (i.e. 7/17/15 18:50:59:616 GMT)
status (i.e. success or failure)

Can someone help me write the grok filter to filter out the two fields, please?


                              TraceLogMessage :   NA: CurrencyExchangeRate: An unexpected error occured when retreiving the currency exchange file and processing it.

[7/17/15 18:50:59:616 GMT] 00000688 CurrencyExcha I ***********************************************************************
[7/17/15 18:50:59:616 GMT] 00000688 CurrencyExcha I Currency Exchange Rate File Retreival at 2015/07/17 18:50:59 failure.
[7/17/15 18:50:59:617 GMT] 00000688 CurrencyExcha I ***********************************************************************
TraceLogMessage : NA: CurrencyExchangeRate: An unexpected error occured when retreiving the currency exchange file and processing it.
[7/18/15 18:50:59:616 GMT] 00000688 CurrencyExcha I ***********************************************************************
[7/18/15 18:50:59:616 GMT] 00000688 CurrencyExcha I Currency Exchange Rate File Retreival at 2015/07/17 18:50:59 success.
[7/18/15 18:50:59:617 GMT] 00000688 CurrencyExcha I ***********************************************************************
TraceLogMessage : NA: CurrencyExchangeRate: An unexpected error occured when retreiving the currency exchange file and processing it.
[7/19/15 18:50:59:616 GMT] 00000688 CurrencyExcha I ***********************************************************************
[7/19/15 18:50:59:616 GMT] 00000688 CurrencyExcha I Currency Exchange Rate File Retreival at 2015/07/17 18:50:59 failure.
[7/19/15 18:50:59:617 GMT] 00000688 CurrencyExcha I ***********************************************************************

So everything's on a single line, i.e.

[7/17/15 18:50:59:616 GMT] 00000688 CurrencyExcha I Currency Exchange Rate File Retreival at 2015/07/17 18:50:59 failure.

rather than

[7/17/15 18:50:59:616 GMT] 00000688 CurrencyExcha I Currency Exchange Rate File Retreival at 
2015/07/17 18:50:59 failure.

or...?

Yes, everything is on one line, Magnus.

I was able to make a bit of progress on my own.
Given Input:
[7/17/15 18:50:59:616 GMT] 00000688 CurrencyExcha I Currency Exchange Rate File Retreival at 2015/07/17 18:50:59 failed.

Filter:
%{DATESTAMP:tslice} %{GREEDYDATA} %{WORD:status}

Output:
{
"tslice": [
[
"7/17/15 18:50:59:616"
]
],
"status": [
[
"failed"
]
]
}

I'm not exactly sure why it worked... :slight_smile: Not sure how GREDYDATA decided to stop where it did.

This filter I have isn't specific enough. I want the filter to just capture the Success/Failures of the exchange rate retrieval and ignore everything else from that log. As it stands, it would parse any and every line that has a timestamp and that's not what I want.

Not sure how GREDYDATA decided to stop where it did.

Because you said that there's a WORD at the end.

This filter I have isn't specific enough. I want the filter to just capture the Success/Failures of the exchange rate retrieval and ignore everything else from that log.

This should be close enough:

filter {
  grok {
    match => [
      "message",
      "^\[%{DATESTAMP:tslice} GMT\] \d+ CurrencyExcha I Currency Exchange Rate File Retreival %{GREEDYDATA} %{WORD:status}$"
    ]
  }
  if "_grokparsefailure" in [tags] {
    drop { }
  }
}

Magnus - thank you very much. That works great.

Is _grokparserfailure thrown when the parser can't extract one of the fields in the expression? What all is returned in [tags]?

The _grokparsefailure tag is added when none of the supplied expressions match the field they're supposed to match.

Thank you - lastly:

This config I now have is meant to feed only one Index type in ElasticSearch. We may need to feed other Indices from the same log. Do I need to create different config files and start separate LogStash instances for those? Or do I somehow include Input/Filter/Output for those other indices all in the same config file? Please -

Look into conditionals.

Thanks again -

My logstash config file is working now, I'm parsing the input log file and extracting the right data, all good here. The event looks good in the RUBYDEBUG from logstash. The issue I'm having is that I started redirecting my extracted events to Elasticsearch and now event structure that I get into ES is different than what I defined it using _Mappings definition. Somehow logstash is overriding the document properties.

Here is what it looks like in logstash console (all good):

{
       "message" => "[7/30/15 18:50:59:616 GMT] 00000688 CurrencyExcha I   Currency Exchange Rate File Retreival at 2015/07/30 18:50:59 success",
      "@version" => "1",
    "@timestamp" => "2015-08-06T16:13:38.100Z",
          "host" => "IBM-EN189AKEUJ4",
          "path" => "C:/logstash-1.5.3/inputs/SystemOut1.log",
          "type" => "sysout",
        "tslice" => "7/30/15 18:50:59:616",
        "status" => "success"
}

When it gets to ES it looks terrible - Logstash stuffs everything into the [_source]{message} array, seemingly ignoring the Mapping definition in ES for index 'sprint'... The doc looks like this in ES:

curl -XGET 'localhost:9200/sprint/sysout/AU8DyhBIErJTJM_bP4rF?pretty'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   471  100   471    0     0  31400      0 --:--:-- --:--:-- --:--:-- 31400{
  "_index" : "sprint",
  "_type" : "sysout",
  "_id" : "AU8DyhBIErJTJM_bP4rF",
  "_version" : 1,
  "found" : true,
  "_source":{"message":"[7/30/15 18:50:59:616 GMT] 00000688 CurrencyExcha I   Currency Exchange Rate File Retreival at 2015/07/30 18:50:59 success","@version":"1","@timestamp":"2015-08-06T16:13:38.100Z","host":"IBM-EN189AKEUJ4","path":"C:/logstash-1.5.3/inputs/SystemOut1.log","type":"sysout","tslice":"7/30/15 18:50:59:616","status":"success"}
}

Here is my _Mapping that I've created for 'sprint' index are as follows:

 curl -XGET "localhost:9200/sprint/_mappings?pretty"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   760  100   760    0     0  47500      0 --:--:-- --:--:-- --:--:-- 47500{
  "sprint" : {
    "mappings" : {
      "sysout" : {
        "properties" : {
          "@timestamp" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "@version" : {
            "type" : "string"
          },
          "host" : {
            "type" : "string"
          },
          "message" : {
            "type" : "string"
          },
          "path" : {
            "type" : "string"
          },
          "status" : {
            "type" : "string"
          },
          "tslice" : {
            "type" : "date",
            "format" : "MM/dd/yy HH:mm:ss:SSS"
          },
          "type" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

I should also paste the current config file as it exists right now:

input { 
    # Log file
    file {
       path => "C:/logstash-1.5.3/inputs/SystemOut1.log"
       type => "sysout"
       #start_position => "beginning"
      }
      
      # Standard input file
      #stdin { }
}  

filter {
  grok {
    match => [
      "message",
      "^\[%{DATESTAMP:tslice} GMT\] %{GREEDYDATA} (%{WORD:status}(\.)?)$"
    ]
  }
  if "_grokparsefailure" in [tags] {
    drop { }
  }
}
    
output { 
    elasticsearch {
        host => localhost
        index => "sprint"
    }
    stdout { 
        codec => rubydebug 
    } 
}

Why can't I get the individual fields in sprint/sysout to get populatedproperly? All I really care about in that Document are fields tslice and status...

Everything is working fine. When you use the Get API the top-level properties of the returned document contain metadata (_id, _type, ...) and the document itself is found in the _source property. See the example in the documentation.

I thought I was starting to understand this product....but I was wrong.

I initially created the sprint index by issuing the following command:
curl -XPUT 'localhost:9200/sprint/sysout/1?pretty' -d '
{
"tslice": "15/07/10 07:04:36:321",
"status": "failed"
}'

I didn't even specify message in my document definition properties initially (it was created dynamically by logstash along with these other fields). I did have tslice and status fields in the document at the root level, but they were not populated?! I don't get it.

Why are you saying that the tslice and status fields are missing? They're right there, on the line that begins with "_source", near the end.

I guess I expected them to be present at the document root, just where I defined them:

"sprint" : {
"mappings" : {
"sysout" : {
"properties" : {
"@timestamp" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"@version" : {
"type" : "string"
},
"host" : {
"type" : "string"
},
"message" : {
"type" : "string"
},
"path" : {
"type" : "string"
},
"status" : {
"type" : "string"
},
"tslice" : {
"type" : "date",
"format" : "MM/dd/yy HH:mm:ss:SSS"
},
"type" : {
"type" : "string"
}

But they are in the document root. It's the Get API that returns the original document in the _source field. If you run the contents of the _source field through a JSON pretty printer it looks like this:

{
    "message": "[7/30/15 18:50:59:616 GMT] 00000688 CurrencyExcha I   Currency Exchange Rate File Retreival at 2015/07/30 18:50:59 success",
    "@version": "1",
    "@timestamp": "2015-08-06T16:13:38.100Z",
    "host": "IBM-EN189AKEUJ4",
    "path": "C:/logstash-1.5.3/inputs/SystemOut1.log",
    "type": "sysout",
    "tslice": "7/30/15 18:50:59:616",
    "status": "success"
}

That's your original, the source document.

Huh... I see that now - sorry, I was being difficult. Is there a way to run contents of _source through JSON pretty print in ES?! Or do you do that in something like JSONLint?

In this case I used jsonlint.com. Since you supplied ?pretty in the URL I would've expected ES to have pretty printed the document as well.

Ok - thanks for your patience and advice.