Help with Grok & Regexp: effective reuse of patterns

Bruno_Lavoie · May 27, 2016, 4:11pm

Hello,

I'm trying to extract fields from messages and making some reuse of information tokens but sometimes my regexps are too greedy and capture outside the end tag.

I have a few different type of messages but some tokens are reappearing (or not) between type of messages and in different order.

Tokens are formatted as : PropertyName={property value}

Here an example of 2 messages:

INFO com.l7tech.server.policy.assertion.ServerAuditDetailAssertion: -4: Message={Ouverture session}, MessageType={TypeMessageAuthentification}, Username={etud1}, SessionAuthorization={5af2143d-7368-46de-955e-2a014bd7d39f}, SessionClient={ZxFTldw1Ed242t1L7dML8cIgkM}, RequestId={00000154e84e6393-11147}, ServiceName={auth/oauth/v2/authorize}, ServiceId={7aea4881665af7743edf0dcb0d8ddfef}, ServiceGuid={304e9de3-ba27-4260-b448-e3476530a0c2}, ServiceVersion={54}, ClusterNodeName={Gateway1}
INFO com.l7tech.server.policy.assertion.ServerAuditDetailAssertion: -4: Message={Fermeture session}, MessageType={TypeMessageAuthentification}, SessionClient={p051Q3ZdhoCaNo9ASMp11uEhXHU}, RequestId={00000154e84e6393-1100b}, ServiceName={UL Page Logout}, ServiceId={7aea4881665af7743edf0dcb0d924d36}, ServiceGuid={b931e09d-9eaf-4e43-a14a-120815141b5d}, ServiceVersion={59}, ClusterNodeName={Gateway1}

To make it easily readable and fully reusable I have made a pattern for each recurring tokens:

################## 
# General pattern, identical beginning part of each messages

UL_GSA_BASE             %{LOGLEVEL}%{SPACE}%{JAVACLASS}: %{INT}:
UL_GSA_COMMON           %{LOGLEVEL}%{SPACE}%{JAVACLASS}: %{INT}: %{UL_GSA_MESSAGE}, %{UL_GSA_MESSAGE_TYPE}
UL_GSA_COMMON_REMAINING %{UL_GSA_COMMON}, %{GREEDYDATA:remaining}

##################
# Reusable patterns 

UL_GSA_MESSAGE         Message={(?<message>.*?)}
UL_GSA_MESSAGE_TYPE    MessageType={(?<message_type>.*?)}

UL_GSA_USERNAME         Username={(?<username>.*?)}

UL_GSA_SESSION_CLIENT        SessionClient={(?<session_client>.*?)}
UL_GSA_SESSION_AUTHORIZATION SessionAuthorization={(?<session_authorization>.*?)}

UL_GSA_SSO              SSO={(?<sso>.*?)}
UL_GSA_REQUEST_ID       RequestId={(?<request_id>.*?)}
UL_GSA_API_KEY          APIKey={(?<api_key>.*?)}
UL_GSA_IP_ADDRESSES     IpAddresses={(?<api_adresses>.*?)}

UL_GSA_RESPONSE_CODE    ResponseCode={(?<response_code>.*?)}
UL_GSA_RESPONSE_LATENCY ResponseLatency={(?<response_latency>.*?)}
UL_GSA_RESPONSE_ERROR   ResponseError={(?<response_error>.*?)}

UL_GSA_SERVICE_ID       ServiceId={(?<service_id>.*?)}
UL_GSA_SERVICE_NAME     ServiceName={(?<service_name>.*?)}
UL_GSA_SERVICE_GUID     ServiceGuid={(?<service_guid>.*?)}
UL_GSA_SERVICE_VERSION  ServiceVersion={(?<service_version>.*?)}

UL_GSA_GATEWAY_HOST        GatewayHost={(?<gateway_host>.*?)}
UL_GSA_GATEWAY_SERVICE_URL GatewayServiceUrl={(?<gateway_service_url>.*?)}
UL_GSA_CLUSTER_NODE_NAME   ClusterNodeName={(?<cluster_node_name>.*?)}

Then I construct final patterns by reusing the needed parts in the good order:

################## 
# Specific patterns

UL_GSA_OUVERTURE_SESSION %{UL_GSA_COMMON}, %{UL_GSA_USERNAME}, %{UL_GSA_SESSION_AUTHORIZATION}, %{UL_GSA_SESSION_CLIENT}, %{UL_GSA_REQUEST_ID}, %{UL_GSA_SERVICE_NAME}, %{UL_GSA_SERVICE_ID}, %{UL_GSA_SERVICE_GUID}, %{UL_GSA_SERVICE_VERSION}, %{UL_GSA_CLUSTER_NODE_NAME}
UL_GSA_FERMETURE_SESSION %{UL_GSA_COMMON}, %{UL_GSA_SESSION_CLIENT}, %{UL_GSA_REQUEST_ID}, %{UL_GSA_SERVICE_NAME}, %{UL_GSA_SERVICE_ID}, %{UL_GSA_SERVICE_GUID}, %{UL_GSA_SERVICE_VERSION}, %{UL_GSA_CLUSTER_NODE_NAME}

Using http://grokdebug.herokuapp.com/ site, testing each message type individually with corresponding pattern works fine.

But when using

INFO com.l7tech.server.policy.assertion.ServerAuditDetailAssertion: -4: Message={Ouverture session}, MessageType={TypeMessageAuthentification}, Username={etud1}, SessionAuthorization={5af2143d-7368-46de-955e-2a014bd7d39f}, SessionClient={ZxFTldw1Ed242t1L7dML8cIgkM}, RequestId={00000154e84e6393-11147}, ServiceName={auth/oauth/v2/authorize}, ServiceId={7aea4881665af7743edf0dcb0d8ddfef}, ServiceGuid={304e9de3-ba27-4260-b448-e3476530a0c2}, ServiceVersion={54}, ClusterNodeName={Gateway1}

with the non-corresponding pattern %{UL_GSA_FERMETURE_SESSION} my UL_GSA_MESSAGE_TYPE pattern seems to be too greedy and captures to much, while I would like matching to fail.

Captured message_type field then looks like this:

{
  "message": [
    [
      "Ouverture session"
    ]
  ],
  "message_type": [
    [
      "TypeMessageAuthentification}, Username={etud1}, SessionAuthorization={5af2143d-7368-46de-955e-2a014bd7d39f"
    ]
  ],
...

So, I tried a lot of thing...

How to make my capture strictly between one pair of curly braces?

Thanks a lot

magnusbaeck · May 29, 2016, 10:50am

Try this:

UL_GSA_MESSAGE         Message={(?<message>[^}]+)}

But primarily I'd look into the kv filter. Its specific purpose is to parse data like this.

Bruno_Lavoie · May 30, 2016, 5:07pm

Shame on me!
Never thought of kv filter.
Makes my stuff more resilient to changed and new messages.
Juste keeping my patterns with your regex suggestion, just in case of.

Thanks