Grok URI extract

I've been searching and reading documentation but I can't seem to find how to extract parameters from a URI. The best I could do so far is :

  grok {
     match => [ "url", "%{URIPROTO:uri_proto}://(?:%{USER:user}(?::[^@]*)?@)?(?:%{URIHOST:uri_domain})?(?:%{URIPATHPARAM:uri_param})?" ]
  }

Result

"url" => "http://hostname.domain.tld/_astats?application=&inf.name=eth0"
"uri_proto" => "http",
"uri_domain" => "hostname.domain.tld",
"uri_param" => "/_astats?application=&inf.name=eth0" 

I'd like to separate all the components in the following field... "uri_domain, uri_path, uri_root", I would also like to extract the query string into sub fields like "query_params: application, inf.name".

For example, how do I expand URIPATH and URIPARAM vs URIPATHPARAM? Why isn't all of them expanded when I call just "URI" through these GROK definitions?

paths

PATH (?:%{UNIXPATH}|%{WINPATH})
UNIXPATH (/([\w_%!$@:.,~-]+|\\.)*)+
TTY (?:/dev/(pts|tty([pq])?)(\w+)?/?(?:[0-9]+))
WINPATH (?>[A-Za-z]+:|\\)(?:\\[^\\?*]*)+
URIPROTO [A-Za-z]+(\+[A-Za-z+]+)?
URIHOST %{IPORHOST}(?::%{POSINT:port})?
# uripath comes loosely from RFC1738, but mostly from what Firefox
# doesn't turn into %XX
URIPATH (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+
#URIPARAM \?(?:[A-Za-z0-9]+(?:=(?:[^&]*))?(?:&(?:[A-Za-z0-9]+(?:=(?:[^&]*))?)?)*)?
URIPARAM \?[A-Za-z0-9$.+!*'|(){},~@#%&/=:;_?\-\[\]<>]*
URIPATHPARAM %{URIPATH}(?:%{URIPARAM})?
URI %{URIPROTO}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST})?(?:%{URIPATHPARAM})?

Thanks for any information.

If you run this GROK statement on your uri_param field you can get these results. This will break it up into two separate components. No idea if it is the best method, but it has worked ok for us.

GROK
%{GREEDYDATA:uri_stem}\?%{GREEDYDATA:uri_query}

Results

If you want the query portion broken down into separate components you can try using the KV filter.

 kv {
                source => "uri_query"
                field_split => "&"
                prefix => "query_"
        }

So now we can take the uri_query we generated above (with added parameters).
{"uri_query":"application=&inf.name=eth0&test1=blah&test2=blahblahblah"}
And when we run it through that KV filter we get:

{
       "uri_query" => "application=&inf.name=eth0&test1=blah&test2=blahblahblah",
       "query_inf.name" => "eth0",
       "query_test1" => "blah",
       "query_test2" => "blahblahblah"
}

You can then drop the original uri_query field or keep it around. If you want to limit which fields get created use the include_keys so that you don't accidentally create hundreds of new fields in your cluster.

2 Likes

Thanks, this was really helpful...

I found the "target" parameter to put the query string into a container. Exactly what I needed. I couldn't put all my "match" statement in the same grok, had to be separated...

I'm still baffled that I can't just call "URI" to extract everything based on the logstash-grok pattern here : https://github.com/hpcugent/logstash-patterns/blob/master/files/grok-patterns

filter {
  grok {
    match => [ "url", "%{URIPROTO:uri_proto}://(?:%{USER:user}(?::[^@]*)?@)?(?:%{URIHOST:uri_domain})?(?:%{URIPATHPARAM:uri_param})?" ]
  }
  grok {
    match => [ "uri_param", "%{GREEDYDATA:uri_path}\?%{GREEDYDATA:uri_query}" ]
  }

  kv {
    source => "uri_query"
    field_split => "&"
    target => "query"
  }
}

The result was

  "url" => "http://cdn1cdedge0001.coxlab.net/_astats?application=&inf.name=eth0",
 "uri_proto" => "http",
 "uri_domain" => "cdn1cdedge0001.coxlab.net",
 "uri_param" => "/_astats?application=&inf.name=eth0",
 "uri_path" => "/_astats",
 "uri_query" => "application=&inf.name=eth0",
 "query" => {
    "inf.name" => "eth0"
 }

Then I can simply remove "uri_query" since it's now duplicate. Although I haven't tested this into ES yet.

I think for whatever reason the URI Grok statement just wasn't designed how you or I want it to work.

A comment on the second GROK statement.

As long as there are query parameters (Anything after the question mark) this will work fine.
However if you have a URI that does not have this this GROK statement will not match. That means that uri_query and uri_path will not be populated. uri_query is a given since there is no query, but depending on your usage you may still want uri_path to show up. At this point it would hold the same value as uri_param, but for aggregations of searching that may not matter.
If you still want uri_path to be populated regardless you can try using this version. If the first match doesn't work, it will go onto the second statement. You could also try using conditionals to check if the field doesn't exist and add it, but this method seems pretty simple.

  grok {
      break_on_match => true
      match => [ "uri_param", "%{GREEDYDATA:uri_path}\?%{GREEDYDATA:uri_query}" ,
                 "uri_param","%{GREEDYDATA:uri_path}"
               ]
  }
2 Likes

Hey, how do I split uri_stem?

Which uri_stem? We mentioned a few here.
And how do you want it split up?

Hey thanks for replying. I was able to figure that out later.
Here is the link of my post in which I had an issue. If anyone still looking similar problem and the solution.