I've been searching this and am directed to the same page most of the time. I am trying to write grok patterns for my custom logs. For example, below is a log line:
Now, most of my searches lead me to this link. But I can't understand it or how to use it. Is there is a tutorial to explain exactly what type of data each defined pattern renders. Like I can't understand what NOTSPACE is for or what GREEDYDATA is for and many more. I can't find out how to grok this 5E82093B:7550_B0092619:01BB_5E0DAC6F_33A27FC:05AD from above log.
P.S. I am aware of grok debugger (this). And I wrote below pattern using it. Just to give an idea where I am at.
Start by matching the first field on the line. Do not try to match anything more than that. I never use grok debuggers, I use logstash. Start a copy of logstash with
--config.reload.automatic
enabled. That way you only pay the startup cost once, and it will reload the configuration and reinvoke the pipeline each time you modify the configuration. I would start with
I am not aware of a pattern that ships with logstash that matches your date/time format, so I defined one myself. Once that works edit the configuration (in another window) and write it out. logstash will process it.
Note that I anchor the pattern using ^, so it has to match at the start of the line. Read this to understand why.
Once you have that working start adding fields. You should end up with something like
1 - If the amount or type of space (tab versus space) between two fields is variable you can use \s+ (that's one or more characters that count as whitespace). If the space is missing sometimes you can use \s* (zero or more).
2 - If a field is optional you can wrap it in ( and )? -- hard to say more without examples.
Timestamp, request_id, URL are common but query sometimes appears and sometimes not followed by processed in. I think these two can be covered using same pattern, not sure how.
When there is a common prefix to the lines, I would parse that first. Also, I would typically use dissect rather than grok to do it. dissect has limited functionality, but it is cheap. grok can do many, many, things and there are almost always multiple ways to write a grok fitler, but that comes at a price. grok filters can be really expensive (there is a reason there is a 30 second timeout for matching a pattern). To do it with dissect I would use
dissect { mapping => { "message" => "%{[@metadata][timestamp]} %{+[@metadata][timestamp]} RequestID: %{requestId} - URL: %{url} - %{[@metadata][restOfLine]}" } }
date { match => [ "[@metadata][timestamp]", "yyyy/MM/dd HH:mm:ss" ] }
grok { match => { "[@metadata][restOfLine]" => "processed in <%{NUMBER:processingTime:float}>" } }
Note that in the second grok the pattern is not anchored, but it starts with fixed text, so it will not be very expensive (no backtracking).
If you do not want to use dissect, you can achieve the same in grok. Note I need to use alternation to capture the uri, because the host and protocol are missing on one of your lines.
grok {
pattern_definitions => { "MYDATETIME" => "%{YEAR}/%{MONTHNUM}/%{MONTHDAY} %{TIME}" }
match => { "message" => "^%{MYDATETIME:[@metadata][timestamp]} RequestID: %{NOTSPACE:uuid} - URL: (%{URI:uri}|%{URIPATHPARAM:uri}) - %{GREEDYDATA:[@metadata][restOfLine]}" }
}
date { match => [ "[@metadata][timestamp]", "yyyy/MM/dd HH:mm:ss" ] }
grok { match => { "[@metadata][restOfLine]" => "processed in <%{NUMBER:processingTime:float}>" } }
Finally, if you need to capture the query parameters, and really do not want to use multiple groks, you can combine everything into one and use ( and )? to make the query field optional.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.