Help needed in grok

I've been searching this and am directed to the same page most of the time. I am trying to write grok patterns for my custom logs. For example, below is a log line:

2020/01/02 08:40:16 UUID: 5E82093B:7550_B0092619:01BB_5E0DAC6F_33A27FC:05AD - URL: https://endpoint.point/path/to/api 0.011636824 elapsed(s)

Now, most of my searches lead me to this link. But I can't understand it or how to use it. Is there is a tutorial to explain exactly what type of data each defined pattern renders. Like I can't understand what NOTSPACE is for or what GREEDYDATA is for and many more. I can't find out how to grok this 5E82093B:7550_B0092619:01BB_5E0DAC6F_33A27FC:05AD from above log.

P.S. I am aware of grok debugger (this). And I wrote below pattern using it. Just to give an idea where I am at.

%{IPORHOST:clientip} - %{DATA:somedata} \[%{HTTPDATE:timestamp}\] "%{WORD:verb} %{URIPATH:request} HTTP/(?<httpversion>[0-9.]*)" %{NUMBER:response_code} %{NUMBER:bytes_transferred} %{DATA:referrer} "%{DATA:agent_info}"rt=%{NUMBER:request_time} uct="%{NUMBER:upstream_connect_time}" uht="%{NUMBER:upstream_header_time}" urt="%{NUMBER:upstream_response_time}"

I have specific questions as well. Please answer them as they will be helpful:

  1. If space in between two fields is unpredictable, how do I manage that?
  2. If a field or two in a log line is not consistent, how do ensure it's index field is created when it does appear in kibana?
  3. How do I grok above UUID?

Start by matching the first field on the line. Do not try to match anything more than that. I never use grok debuggers, I use logstash. Start a copy of logstash with

--config.reload.automatic

enabled. That way you only pay the startup cost once, and it will reload the configuration and reinvoke the pipeline each time you modify the configuration. I would start with

input { generator { count => 1 lines => [ '2020/01/02 08:40:16 UUID: 5E82093B:7550_B0092619:01BB_5E0DAC6F_33A27FC:05AD - URL: https://endpoint.point/path/to/api 0.011636824 elapsed(s)' ] } }
filter {
    grok {
        pattern_definitions => { "MYDATETIME" => "%{YEAR}/%{MONTHNUM}/%{MONTHDAY} %{TIME}" }
        match => { "message" => "^%{MYDATETIME:time} " }
    }
}
output { stdout { codec => rubydebug { metadata => false } } }

I am not aware of a pattern that ships with logstash that matches your date/time format, so I defined one myself. Once that works edit the configuration (in another window) and write it out. logstash will process it.

Note that I anchor the pattern using ^, so it has to match at the start of the line. Read this to understand why.

Once you have that working start adding fields. You should end up with something like

match => { "message" => "^%{MYDATETIME:time} UUID: %{NOTSPACE:uuid} - URL: %{URI:uri} %{NUMBER:elapsed:float} " }

To answer your specific question:

1 - If the amount or type of space (tab versus space) between two fields is variable you can use \s+ (that's one or more characters that count as whitespace). If the space is missing sometimes you can use \s* (zero or more).

2 - If a field is optional you can wrap it in ( and )? -- hard to say more without examples.

3 - I used NOTSPACE to capture the UUID.

2 Likes

Well, this helps a lot. I've been struggling man, thanks.
Here's an example of the second question. These are two different log lines:

2019/12/06 06:45:07 RequestID: werq5-wetfe4-3eqwetq353r4-QWR3a - URL: http://path/to/end/point - processed in <0.226>(s)

2019/12/06 06:45:07 RequestID: 4q534-q4t43t4-43545q24tq43-q3f4 - URL: /some/path/to/api - Query:some=2&values=5&variables_query - processed in <0.226>(s)

Timestamp, request_id, URL are common but query sometimes appears and sometimes not followed by processed in. I think these two can be covered using same pattern, not sure how.

When there is a common prefix to the lines, I would parse that first. Also, I would typically use dissect rather than grok to do it. dissect has limited functionality, but it is cheap. grok can do many, many, things and there are almost always multiple ways to write a grok fitler, but that comes at a price. grok filters can be really expensive (there is a reason there is a 30 second timeout for matching a pattern). To do it with dissect I would use

dissect { mapping => { "message" => "%{[@metadata][timestamp]} %{+[@metadata][timestamp]} RequestID: %{requestId} - URL: %{url} - %{[@metadata][restOfLine]}" } }
date { match => [ "[@metadata][timestamp]", "yyyy/MM/dd HH:mm:ss" ] }
grok { match => { "[@metadata][restOfLine]" => "processed in <%{NUMBER:processingTime:float}>" } }

Note that in the second grok the pattern is not anchored, but it starts with fixed text, so it will not be very expensive (no backtracking).

If you do not want to use dissect, you can achieve the same in grok. Note I need to use alternation to capture the uri, because the host and protocol are missing on one of your lines.

grok {
    pattern_definitions => { "MYDATETIME" => "%{YEAR}/%{MONTHNUM}/%{MONTHDAY} %{TIME}" }
    match => { "message" => "^%{MYDATETIME:[@metadata][timestamp]} RequestID: %{NOTSPACE:uuid} - URL: (%{URI:uri}|%{URIPATHPARAM:uri}) - %{GREEDYDATA:[@metadata][restOfLine]}" }
}
date { match => [ "[@metadata][timestamp]", "yyyy/MM/dd HH:mm:ss" ] }
grok { match => { "[@metadata][restOfLine]" => "processed in <%{NUMBER:processingTime:float}>" } }

Finally, if you need to capture the query parameters, and really do not want to use multiple groks, you can combine everything into one and use ( and )? to make the query field optional.

grok {
    pattern_definitions => { "MYDATETIME" => "%{YEAR}/%{MONTHNUM}/%{MONTHDAY} %{TIME}" }
    match => { "message" => "^%{MYDATETIME:[@metadata][timestamp]} RequestID: %{NOTSPACE:uuid} - URL: (%{URI:uri}|%{URIPATHPARAM:uri}) - (Query:%{NOTSPACE:uriQuery} - )?processed in <%{NUMBER:processingTime:float}>" }
}

This worked for me. Thanks for your help. Cheers

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.