How to correctly parse and then enrich log data in Logstash?

I am working on my thesis and I have this setup: Webserver->Filebeats->Logstash->Kafka->ELK Server

I have logs that contain the product info in the GET request from the webserver, depending on what was being served. An example would be:
GET /product-category/mens-clothing/jacket/ (then if you clicked on the black leather jacket you would get the next line)
GET /product/black-leather-jacket/

I want to be able to retrieve a category (men's clothing), type (jacket), product (black leather jacket) from the log data so I can do some ML and look at what particular customers are doing. I have a regex to match either but I don't know how to handle both variations.

Hence my question, how do I:

  1. Match both
  2. In the case of the only having the url with the product, how would I then enrich this data with some external source like a database or file? Eg. So when I get the leather jacket URL and then poll the database/file to look for what category and type are associated with the leather jacket, eg jacket and men's clothing?

Do you have a pattern for the log already?
If so you should be able to split the URL field into parts.

In the case of the only having the url with the product, how would I then enrich this data with some external source like a database or file? Eg. So when I get the leather jacket URL and then poll the database/file to look for what category and type are associated with the leather jacket, eg jacket and men's clothing?

Have a look at the translate filter.

--- warkolm ---

Yeah I do. But I want to be able to match on both, how do I do that? As they are two different logs, from different user clicks.

With: GET /product-category/mens-clothing/jacket/ - I can see what categories etc they interested.. so the product kv pair would have an empty value.

With: GET /product/black-leather-jacket/ - I would need to enrich the logs by search for what category etc this product relates to and then adding that info to the log. In this case this would give me: the category (men's clothing), type (jacket) and product (black leather jacket).

If that isn't clear let me know.

--- Magnus ---

Thanks will do, thx.

Can you provide your config, it'll give us better context.

Excuse the tardy reply, I had an assessment due.

I was asking a long winded question with my conf file and I kind of figured out how to do the matches as I was asking the question so thanks.

I just need to work on the enrichment now with the database information.

Providing your solution may help others in the future :slight_smile:

Good point... as it wasn't initially apparent on how to do this.

Firstly I have changed the webserver to list dummy financial products (as this is what I am stating I am doing in my thesis). So if the URLs look different from my original post, that's why.

So I have a webserver that logs URLs that I want to parse in Logstash, so I can group them in on a per user basis in Kibana. I also want to get out of a lot of the crap out of the log file (see the last line)

An example log would be:

<IP removed> - - [07/Apr/2016:12:41:38 +0000] "GET /product-category/banking/transaction-account/ HTTP/1.1" 200 6777 http://<IP removed>/product-category/banking/transaction-account/ wp-settings-1=libraryContent%3Dbrowse; wp-settings-time-1=1460008031; wordpress_test_cookie=WP+Cookie+check; wordpress_logged_in_cf8749745210721969771dfabf2df0ea=nathansturgess.pt%7C1460181231%7Czd5rFlzqVwPlbmfKEzJvzVhKeRoNLVfOJPbQNiHAZTs%7858084f4596e3a6863df44a88fbb59Cdc2d2145c62c3c6083519c1cfb8cf47ff2

I want to match on the apache log data, the username in the cookie and on urls that could be in the following format:

/product-category/<category>/ an example would be /product-category/banking/
/product-category/<category>/<subcategory>/ an example would be  /product-category/banking/transction-account/
/product/<product>/ example would be /product/everyday-account/

To do this matching I have the following filter snippet from my logstash.conf file:

filter {
    grok {
        match => { "message" => "\A%{COMMONAPACHELOG}(.*=(?<username>[^%]*)%)" }
    }
    grok {
        match => [ "request", "/product-category\/(?<category>.*?)\/" ]
    }
    grok {
        match => [ "request", "/product-category\/.*?\/(?<subcategory>.*?)\/" ]
    }
    grok {
        match => [ "request", "/product\/(?<product>.*?)\/" ]
    }
    mutate {
        remove_field => ["ident", "auth", "source", "hostname", "name", "host", "beat", "input_type"]
    }
}

Here is a screenshot of a couple of lines of Kibana

I then plan to enrich the logs with only the product with the category and subcategory and from lines with the subcategory to enrich the logs with the category. I need to figure out how to do this.

I'd expect the pipeline to be Webserver->Filebeats->Kafka->Logstash->Elasticsearch

Only the nightly beats builds for 5.0 support direct to kafka.

Good to know. I have only looked at master so go figure.

NOTE: No one pulled me up on this, but I figured out that the correct way to do the grok on the request is with a one line conditional regex:

filter {
    grok {
        match => { "message" => "\A%{COMMONAPACHELOG}(.*=(?<username>[^%]*)%)" }
    }
    grok {
        match => [ "request", "(/product-category\/(?<category>.*?)\/)|(/product-category\/.*?\/(?<subcategory>.*?)\/)|(/product\/(?<product>.*?)\/)" ]
    }
    mutate {
        remove_field => ["ident", "auth", "source", "hostname", "name", "host", "beat", "input_type"]
    }
}