Parsing URL with Logstash (using ECS fields) nested!

Vincent_Maury · November 29, 2019, 10:12am

Following Parsing URL with Logstash (using ECS fields)
Several improvements like nested fields & tld parsing

# Parser a url field named url (according to rsa meta field name)
filter {
	mutate {
		rename => { "message" => "[url][original]"}
	}
	grok {
		match => {
      "[url][original]" => [
  			# match https://user:pwd@stuff.domain.com:8080/some/path?p1=v1&p2=v2#anchor
  			"%{URIPROTO:[url][scheme]}://(?:%{USER:[url][username]}:(?<[url][password]>[^@]*)@)?(?:%{IPORHOST:[url][address]}(?::%{POSINT:[url][port]}))?(?:%{URIPATH:[url][path]}(?:%{URIPARAM:[url][query]}))?",
  			# match stuff.domain.com:8080/some/path?p1=v1&p2=v2#anchor
  			"%{IPORHOST:[url][address]}(?::%{POSINT:[url][port]})(?:%{URIPATH:[url][path]}(?:%{URIPARAM:[url][query]}))?",
  			# match /some/path?p1=v1&p2=v2#anchor
  			"%{URIPATH:[url][path]}(?:%{URIPARAM:[url][query]})"
      ]
		}
		add_tag => [ "urlparsed" ]
	}
	if "urlparsed" in [tags] {
		# parse the address to distinguish domain or ip
		grok {
			match => {
				"[url][address]" => "(%{IP:[url][ip]}|%{HOSTNAME:[url][domain]})"
			}
		}
		# Requires a custom plugin here, see https://www.elastic.co/guide/en/logstash/current/plugins-filters-tld.html
		tld {
			source => "[url][domain]"
			target => "[url][tld]"
		}
		mutate {
			rename => {
				"[url][tld][domain]" => "[url][registered_domain]"
				"[url][tld][tld]" => "[url][top_level_domain]"
				"[url][tld][sld]" => "[url][second_level_domain]"
				"[url][tld][trd]" => "[url][sub_domain]"
			}
			remove_field => [ "[url][tld]" ]
		}
		# parse the query to extract fragment
		grok {
			match => {
				"[url][query]" => "^\?(?<[url][query]>[A-Za-z0-9$.+!*'|(){},~@%&/=:;_?\-\[\]<>]*)(?:#(?:%{WORD:[url][fragment]}))?"
			}
      overwrite => [ "[url][query]" ]
		}
		kv {
  		source => "[url][query]"
  		field_split => "&"
			value_split => "="
  		target => "[url][queryparams]"
		}
	}
}

webmat · November 29, 2019, 1:38pm

Haven't tested it, but it looks good

A few comments:

ECS comment: make sure to parse out the extension, when there's one
Non-ECS comment: I've taken down a cluster once, parsing out all query params of a busy web application Make sure your custom field url.queryparams uses a datatype that's not going to cause a mapping explosion, like flattened
ECS Nitpick: The best match for IPORHOST in ECS is .address, which is specified as the address "when you're not sure yet if it's an IP, a domain or a unix socket". So for URLs, you'd fill .address, then copy out to .ip if it's an IP, and otherwise copy to .domain.
- This way you end up with .address which is filled reliably, 100% of the time.
- If you need to do IP-specific analysis, you have .ip which is the ip datatype, and lets you do CIDR lookups, for example. Or just looking for exists:url.ip may surface interesting weird stuff.
- If you need to do analysis on domain names, you have .domain and all of the domain breakdown fields which won't contain IP addresses.
This actually brings me to the domain breakdown fields. I don't recall if we have a solid way to break them down by effective TLD in Logstash . But if the data source analyses many domains (as opposed to incoming web traffic on your webserver), it would be interesting to fill the domain breakdown fields as well. So "www.example.co.uk" becomes:
- .top_level_domain:co.uk => to analyze broad traffic destinations
- .registered_domain:example.co.uk => to group traffic by web properties (e.g. assets.example.co.uk, downloads.example.co.uk are all part of example.co.uk)

Vincent_Maury · November 29, 2019, 7:23pm

Thank you very much @webmat
I followed the .address advice and did the TLD parsing (with the custom plugin) and updated the configuration up there accordingly.
I'm still unsure how to address your second point on query params. Once kv filter done, should I convert into json and then use an es template to force the flattened type?

webmat · November 29, 2019, 8:13pm

After the kv filter is done, you should have multiple keys nested under [url][queryparams].

The only thing you would need to do, in order to avoid the mapping explosion, is modify your index template so that the field url.queryparams itself is of type flattened.

So assuming you're using the sample ECS template we provide here, you could add your custom field right below the definition for the query field, like this:

          "query": {
            "ignore_above": 1024, 
            "type": "keyword"
          }, 
          "queryparams": {
            "type": "flattened",
            // other params for the flattened type?
          },

Another option would indeed be to do as you describe. Turn the resulting structure into a big string where perhaps the text datatype could help dig in there.

But I would definitely give flattened a try first. It behaves somewhat like a bunch of keyword fields, but also avoids the mapping explosion.

system · December 27, 2019, 8:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Parsing URL with Logstash (using ECS fields) Logstash	2	2053	November 28, 2019
Parsing Mapping and finding the right ECS fields for logs in general Logstash ecs-elastic-common-schema	2	759	March 2, 2021
Logstash/grok patterns with ECS Logstash ecs-elastic-common-schema	12	3213	November 10, 2020
How do I map fields with ECS if I am using my own grok filters Logstash	1	216	April 18, 2022
Elastic Common Schema for website domain Logstash ecs-elastic-common-schema	2	734	November 29, 2019

Parsing URL with Logstash (using ECS fields) nested!

Related topics