However I can't adjust the format of the original log file to put quotes around that field. Can anybody advise how I can do this with a grok filter please?
The easier way is to look for some form of unique "breakpoints" in the log line and adapt your pattern around it. For instance, those square brackets could help you use GREEDYDATA to capture arbitrary strings and still keep regex greed in check.
Try this one:
But I don't understand why it works! I could understand if there were square brackets were around the log_level (INFO) because that would break the GREEDYDATA capture as it comes after the timezone, so I'd expect this expression to pick up timezone and log_level as one field?
I suppose I'm asking, why doesn't GREEDYDATA pick up the timezone as "Coordinated Universal Time INFO"? How does it know that they are two separate fields?
Because greediness relies a lot on backtracking. It captures the max possible amount of characters and starts removing characters from the capture group until all conditions are satisfied.
Here's a step-by-step progress in order to visualize it better (I converted the grok patterns to pure regex)
As you can imagine it is an expensive process, and depending on the substring position in the log line, it could make sense to use lazy grabbing instead of greedy (DATA instead of GREEDYDATA).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.