How to regex urls with Grok

Hello there,

Sorry to bother but I'm slowly working my way to fully use and understand the stack and right now I'm blocked with a (I guess) simple problem with Logstash.

I parse multiple events through Logstash and everything is going fine, doing few operations on it and so for. But right now I'm a bit lost when I want to perform a regex on the events I parse.

I receive json events, in those event I have a field 'URL' with a complete url that contains some informations I would like to extract before going to the output.
The entries received look like those:

{
	"id": "AVERYRANDOMID",
	"timeRef": "2022-10-25T10:45:05.000+02:00",
	"url": "https://www.mywebsite.com/the-name-of-my-page-12345678",
	"queryParams": {
		"utm_medium": "email",
		"utm_source": "newsletter",
		"utm_campaign": "MyCampaign"
	},
}
{
	"id": "ANOTHERRANDOMID",
	"timeRef": "2022-10-25T10:45:05.000+02:00",
	"url": "https://www.mywebsite.com/",
	"queryParams": {
	},
}
{
	"id": "INEEDANOTHERONE",
	"timeRef": "2022-10-25T10:45:05.000+02:00",
	"url": "https://www.mywebsite.com/4567890",
	"queryParams": {
	},
}
{
	"id": "LASTONEIPROMISE",
	"timeRef": "2022-10-25T10:45:05.000+02:00",
	"url": "https://www.mywebsite.com/the-name-of-my-page-12345678&fbclid=IwAR2Kf_HojrEdNy",
	"queryParams": {
		"utm_medium": "email",
		"utm_source": "newsletter",
		"utm_campaign": "MyOtherCampaign"
	},
}

At the end I would need, for those 4 entries, to retrieve the ID of the article -if present, from the URL. All rules I found about that ID :

  • Always between 6 to 8 numbers
  • Can follow the title of the article/page
  • Can be directly on the root of the website (without the title
  • Can be (ok, will be) followed by extra characters (shared from social network, from utm campaign, etc..)
  • ID can also be absent from the url (because people can sometimes (ahah) use the tools incorrectly)

I have the feeling it can be done via logstash and I can avoid some processing after the output, but right now I'm kinda of stuck. And I'm trying to avoid completely Ruby on that one if that's possible.

Did I miss something? If someone could point me to the right direction, would be much appreciated. :slight_smile:

You might be able to use something as simple as

grok { match => { "url" => "(?<articleId>\d{6,8})" } }

If you need it to be at the end of the URI path then maybe

grok { match => { "url" => "%{URIPROTO}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST})?(?:%{URIPATH:[@metadata][uri]})?" } }    
grok { match => { "[@metadata][uri]" => "(?<articleId>\d{6,8})$" } }

Hello @Badger !

Thanks a million for taking the time. Your first idea was the right and work even better than I expected. Got everything I need with that Grok filter and a nice little add_field got me all the data stored where I want them.

I had some difficulties to understand how Grok works/can help me but thanks to your solution I feel like I won 2 weeks of testing and swearing at my screen. Once again, thanks for the help. :slight_smile:

Edit : And I just found out the add_field is not even necessary. :smiley:
Slowly groking.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.