Parsing URL with Logstash (using ECS fields) nested!

Following Parsing URL with Logstash (using ECS fields)
Several improvements like nested fields & tld parsing

# Parser a url field named url (according to rsa meta field name)
filter {
	mutate {
		rename => { "message" => "[url][original]"}
	}
	grok {
		match => {
      "[url][original]" => [
  			# match https://user:pwd@stuff.domain.com:8080/some/path?p1=v1&p2=v2#anchor
  			"%{URIPROTO:[url][scheme]}://(?:%{USER:[url][username]}:(?<[url][password]>[^@]*)@)?(?:%{IPORHOST:[url][address]}(?::%{POSINT:[url][port]}))?(?:%{URIPATH:[url][path]}(?:%{URIPARAM:[url][query]}))?",
  			# match stuff.domain.com:8080/some/path?p1=v1&p2=v2#anchor
  			"%{IPORHOST:[url][address]}(?::%{POSINT:[url][port]})(?:%{URIPATH:[url][path]}(?:%{URIPARAM:[url][query]}))?",
  			# match /some/path?p1=v1&p2=v2#anchor
  			"%{URIPATH:[url][path]}(?:%{URIPARAM:[url][query]})"
      ]
		}
		add_tag => [ "urlparsed" ]
	}
	if "urlparsed" in [tags] {
		# parse the address to distinguish domain or ip
		grok {
			match => {
				"[url][address]" => "(%{IP:[url][ip]}|%{HOSTNAME:[url][domain]})"
			}
		}
		# Requires a custom plugin here, see https://www.elastic.co/guide/en/logstash/current/plugins-filters-tld.html
		tld {
			source => "[url][domain]"
			target => "[url][tld]"
		}
		mutate {
			rename => {
				"[url][tld][domain]" => "[url][registered_domain]"
				"[url][tld][tld]" => "[url][top_level_domain]"
				"[url][tld][sld]" => "[url][second_level_domain]"
				"[url][tld][trd]" => "[url][sub_domain]"
			}
			remove_field => [ "[url][tld]" ]
		}
		# parse the query to extract fragment
		grok {
			match => {
				"[url][query]" => "^\?(?<[url][query]>[A-Za-z0-9$.+!*'|(){},~@%&/=:;_?\-\[\]<>]*)(?:#(?:%{WORD:[url][fragment]}))?"
			}
      overwrite => [ "[url][query]" ]
		}
		kv {
  		source => "[url][query]"
  		field_split => "&"
			value_split => "="
  		target => "[url][queryparams]"
		}
	}
}

Haven't tested it, but it looks good :+1:

A few comments:

  • ECS comment: make sure to parse out the extension, when there's one :slight_smile:
  • Non-ECS comment: I've taken down a cluster once, parsing out all query params of a busy web application :joy: Make sure your custom field url.queryparams uses a datatype that's not going to cause a mapping explosion, like flattened :slight_smile:
  • ECS Nitpick: The best match for IPORHOST in ECS is .address, which is specified as the address "when you're not sure yet if it's an IP, a domain or a unix socket". So for URLs, you'd fill .address, then copy out to .ip if it's an IP, and otherwise copy to .domain.
    • This way you end up with .address which is filled reliably, 100% of the time.
    • If you need to do IP-specific analysis, you have .ip which is the ip datatype, and lets you do CIDR lookups, for example. Or just looking for exists:url.ip may surface interesting weird stuff.
    • If you need to do analysis on domain names, you have .domain and all of the domain breakdown fields which won't contain IP addresses.
  • This actually brings me to the domain breakdown fields. I don't recall if we have a solid way to break them down by effective TLD in Logstash . But if the data source analyses many domains (as opposed to incoming web traffic on your webserver), it would be interesting to fill the domain breakdown fields as well. So "www.example.co.uk" becomes:

Thank you very much @webmat
I followed the .address advice and did the TLD parsing (with the custom plugin) and updated the configuration up there accordingly.
I'm still unsure how to address your second point on query params. Once kv filter done, should I convert into json and then use an es template to force the flattened type?

After the kv filter is done, you should have multiple keys nested under [url][queryparams].

The only thing you would need to do, in order to avoid the mapping explosion, is modify your index template so that the field url.queryparams itself is of type flattened.

So assuming you're using the sample ECS template we provide here, you could add your custom field right below the definition for the query field, like this:

          "query": {
            "ignore_above": 1024, 
            "type": "keyword"
          }, 
          "queryparams": {
            "type": "flattened",
            // other params for the flattened type?
          }, 

Another option would indeed be to do as you describe. Turn the resulting structure into a big string where perhaps the text datatype could help dig in there.

But I would definitely give flattened a try first. It behaves somewhat like a bunch of keyword fields, but also avoids the mapping explosion.