Grok for string not followed by a second string combination

Hi All,

Apologies if this has been asked before but I couldn't find a similar question in the forum.

I have the following log file line entry:

Tue Apr 05 16:31:39 2016 1 10.102.180.12 37088 d:\directory\Excel filename with a space.xsls a s i r Rest of the log line.

Where the section highlighted can be any of the following combinations of transfer type/security:
a s
a n
b s
b n

essentially the first character represents 'transfer type' (ascii or binary) and the second transfer security (secure, non-secure).

I've been trying to use GROK to obtain the filename/path prior to any of the combinations above, and then continue to parse the rest of the line. Unfortunately the filename can include spaces, and also potentially the "a" character -
as shown in the example above.

I'm struggling to format a regex to use the combinations of transfer type/security to allow me to find the end of the filename section in the log line and select it.

Ideally I'd like something like this as the logstash expression:

%{TIMESTAMP:[@metadata][event_time]} %{NUMBER:transfer_mins} %{IP:remote_host} %{NUMBER:transferred_bytes} %{PATH:path} %{TRANSFER_MODE} %{TRANSFER_SECURITY} %{TRANSFER_STATUS} %{ACCESS MODE} %{GREEDYDATA:the_rest}

Using the following custom patterns:

TIMESTAMP (%{DAY}[\s/-]%{MONTH}[\s/-]%{MONTHDAY}[\s]%{TIME}[\s]%{YEAR})
TRANSFER_MODE ([a|b])
TRANSFER_SECURITY ([s|n])
TRANSFER_STATUS ([i|j|k|o|p|q])
ACCESS_MODE ([a|r])

But this gives me a GROK error (using http://grokdebug.herokuapp.com/) as I don't find a match for %{TRANSFER_MODE} and others.

If I use the following pattern it illustrates the problem with the %PATH capture:

%{TIMESTAMP:[@metadata][event_time]} %{NUMBER:transfer_mins} %{IP:remote_host} %{NUMBER:transferred_bytes} %{PATH:path} %{GREEDYDATA:the_rest}

This finds the following:

{
"[": [
[
"Tue Apr 05 16:31:39 2016"
]
],
"transfer_mins": [
[
"1"
]
],
"remote_host": [
[
"10.102.180.12"
]
],
"transferred_bytes": [
[
"37088"
]
],
"path": [
[
"d:\directory\Excel filename with a space.xsls a s i r Rest of the"
]
],
"the_rest": [
[
"log"
]
]
}

and you can see that the path capture is too greedy.

If any masters-of-grok out there could advise that would me much appreciated :slight_smile:

Cheers,
Steve

Hello,

First of all, minor issue, but your custom grok patterns are somewhat malformed for what you are after. E.g. this

means "match any character in the list containing a, |, and b"

You can either provide a list of valid characters or an ORed capture group, like either of the below (same goes for the rest of your defined patterns except TIMESTAMP):

Apart from that, you should always look for a common "anchor point" to limit your greedy captures and break down the message easier.
One likely candidate for it would be the file extension (especially if it's always an excel file), so having the file extension as a literal string in your pattern creates a boundary for your greedy data path capture.

Something like the below should work (just change the literal file extension to suit your needs):

%{TIMESTAMP:[@metadata][event_time]} %{NUMBER:transfer_mins} %{IP:remote_host} %{NUMBER:transferred_bytes} %{GREEDYDATA:path}.xls(x)? %{TRANSFER_MODE} %{TRANSFER_SECURITY} %{TRANSFER_STATUS} %{ACCESS MODE} %{GREEDYDATA:the_rest}

Hi Paz,

Thanks for the really speedy reply, and for pointing out the error with the ORed capture groups - I'll fix that.

I take your point about finding 'anchors' within the log format, and had considered using the document suffix - until I talked to the technical team who's application produces this logfile. Basically the application is a generic Managed File Transfer App so the file (and also the extension) could be anything at all as it does hundreds (thousands) of different transfers every day.

Given that I can't rely on the extension, that's why I was wondering whether some sort of 'look-ahead' regex to find the next elements and stop the filename capture when I found them was the best approach.

As you might be able to tell, I'm not the best guy around with regex :slight_smile: so apologies for the lack of knowledge or the wrong terminology.

Best regards,
Steve

Hmmm, you may try using the dot character as an anchor (and just grab the file extension but drop it afterwards) like

%{TIMESTAMP:[@metadata][event_time]} %{NUMBER:transfer_mins} %{IP:remote_host} %{NUMBER:transferred_bytes} %{GREEDYDATA:path}[.]%{WORD:file_ext} %{TRANSFER_MODE} %{TRANSFER_SECURITY} %{TRANSFER_STATUS} %{ACCESS MODE} %{GREEDYDATA:the_rest}

but it would break if the filename itself contained dots. Basically in order to construct a safe greedy regex you'd need to know all the weird corner cases or possible values beforehand.

If you do want to try lookaheads, maybe this will suit you

%{TIMESTAMP:[@metadata][event_time]} %{NUMBER:transfer_mins} %{IP:remote_host} %{NUMBER:transferred_bytes}(?<path>.+(?=( (a|b)) )) %{TRANSFER_MODE} %{TRANSFER_SECURITY} %{TRANSFER_STATUS} %{ACCESS MODE} %{GREEDYDATA:the_rest}

which looks for a space + either a or b character + space sequence as a pattern boundary. (assuming a and b are the possible TRANSFER_MODE flag values).

Hi Paz,

Thanks for the assistance. You're right - the filenames may well have dot characters in them so I'd ruled out using it as the marker for the end of the filename (a pity as that would have been a nice straightforward approach).

I think your last suggestion is very close to what I need:

%{TIMESTAMP:[@metadata][event_time]} %{NUMBER:transfer_mins} %{IP:remote_host} %{NUMBER:transferred_bytes}(?<path>.+(?=( (a|b)) )) %{TRANSFER_MODE} %{TRANSFER_SECURITY} %{TRANSFER_STATUS} %{ACCESS MODE} %{GREEDYDATA:the_rest}

which looks for a space + either a or b character + space sequence as a pattern boundary. (assuming a and b are the possible TRANSFER_MODE flag values).

However, what I need is to look for:

space + either a or b + space + either s or n + space

as the pattern boundary.

This is because using "space + either a or b + space" fails when the filename contains " a " or " b ".

Finally, would there be a way to use my custom patterns in your regex example. For instance, rather than:

(?.+(?=( (a|b)) ))

is there a syntax to have:

(?.+(?=( %{TRANSFER_MODE}) ))

or something similar?

Best regards & thanks again for the help - much appreciated.
Steve

OK - so nearly got it but now I'm seeing some odd behaviour after managing to capture the filename.

Given the log line:

Tue Apr 05 16:31:39 2016 1 10.102.180.12 37088 d:\directory\Excel filename with a space.xsls a s i r Rest of the log

and the following GROK pattern (thanks Paz):

%{TIMESTAMP:[@metadata][event_time]} %{NUMBER:transfer_mins} %{IP:remote_host} %{NUMBER:transferred_bytes} (?.+(?=( (a|b) (s|n) ))) %{GREEDYDATA:the_rest}

I see the following output, which includes the filename captured correctly:

{
"[": [
[
"Tue Apr 05 16:31:39 2016"
]
],
"transfer_mins": [
[
"1"
]
],
"remote_host": [
[
"10.102.180.12"
]
],
"transferred_bytes": [
[
"37088"
]
],
"path": [
[
"d:\directory\Excel filename with a space.xsls"
]
],
"the_rest": [
[
"a s i r Rest of the log"
]
]
}

Which is great. But now when I try to go on to extend the GROK to capture the next group after the filename (the "a" in this case to indicate it's an ascii file transfer):

%{TIMESTAMP:[@metadata][event_time]} %{NUMBER:transfer_mins} %{IP:remote_host} %{NUMBER:transferred_bytes} (?.+(?=( (a|b) (s|n) ))) %{TRANSFER_MODE:transfer_mode} %{GREEDYDATA:the_rest}

The GROK debugger shows "No Matches".

My custom patterns are set as follows:

TIMESTAMP (%{DAY}[\s/-]%{MONTH}[\s/-]%{MONTHDAY}[\s]%{TIME}[\s]%{YEAR})
TRANSFER_MODE (a|b)
TRANSFER_SECURITY (s|n)
TRANSFER_STATUS (i|j|k|o|p|q)
ACCESS_MODE (a|r)

If I don't use the custom pattern but instead just define 'transfer mode' as ${WORD} then it works:

%{TIMESTAMP:[@metadata][event_time]} %{NUMBER:transfer_mins} %{IP:remote_host} %{NUMBER:transferred_bytes} (?.+(?=( (a|b) (s|n) ))) %{WORD:transfer_mode} %{GREEDYDATA:the_rest}

but this would then accept any character as the transfer mode and I want to accept only "a" or "b"

Anybody got any idea what (stupid) mistake I've made now .... :slight_smile:

Cheers,
Steve

Actually the issue was a slight initial mislead on my part on the custom grok patterns (the ORed patterns don't require being bracketed), so this should work.

Btw, you can indeed use custom grok patterns as lookahead boundaries, the following

%{TIMESTAMP:[@metadata][event_time]} %{NUMBER:transfer_mins} %{IP:remote_host} %{NUMBER:transferred_bytes} (?<path>.+(?=( %{TRANSFER_MODE} %{TRANSFER_SECURITY} ))) %{TRANSFER_MODE:transfer_mode} %{TRANSFER_SECURITY:transfer_security} %{GREEDYDATA:the_rest}

Gives this output on grokdebug

{
  "[": [
    "Tue Apr 05 16:31:39 2016"
  ],
  "transfer_mins": [
    "1"
  ],
  "remote_host": [
    "10.102.180.12"
  ],
  "transferred_bytes": [
    "37088"
  ],
  "path": [
    "d:\\directory\\Excel filename with a space.xlsx"
  ],
  "transfer_mode": [
    "a"
  ],
  "transfer_security": [
    "s"
  ],
  "the_rest": [
    "i r "
  ]
}

Hi Paz,

Thanks for the update. On my side I don't seem to be able to get the ORed custom patterns to work correctly. Your latest GROK pattern:

%{TIMESTAMP:[@metadata][event_time]} %{NUMBER:transfer_mins} %{IP:remote_host} %{NUMBER:transferred_bytes} (?<path>.+(?=( %{TRANSFER_MODE} %{TRANSFER_SECURITY} ))) %{TRANSFER_MODE:transfer_mode} %{TRANSFER_SECURITY:transfer_security} %{GREEDYDATA:the_rest}

and the custom patterns (without the brackets) works fine with the sample log line I provided, where TRANSFER_MODE is set to 'a' and TRANSFER_SECURITY is set to 's'.

However, if I change the 'a' to a 'b' or the 's' to an 'n' then the GROK pattern fails to match. This seems to show the ORed pattern matching is not working as expected.

See this sample log line, which fails to match:

Tue Apr 05 16:31:39 2016 1 10.102.180.12 37088 d:\directory\Excel filename with a space.xsls b s i r Rest of the log

Interestingly, this line is matched when the regex defined under the custom pattern is embedded directly into the GROK pattern instead, i.e. the GROK pattern is:

%{TIMESTAMP:[@metadata][event_time]} %{NUMBER:transfer_mins} %{IP:remote_host} %{NUMBER:transferred_bytes} (?.+(?=( (a|b) %{TRANSFER_SECURITY} ))) %{GREEDYDATA:the_rest}

Any clue as to why the ORed conditions don't seem to work when supplied as custom patterns?

Best regards,
Steve

Ah - ignore my last post. I discovered I'd got a bunch of whitespace at the end of my ORed patterns which were tripping them up :frowning:

Paz - thanks for all the assistance - this pattern is working perfectly for me now...!!

Cheers,
Steve

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.