How to correctly parse and then enrich log data in Logstash?

nateuni · April 4, 2016, 1:53pm

I am working on my thesis and I have this setup: Webserver->Filebeats->Logstash->Kafka->ELK Server

I have logs that contain the product info in the GET request from the webserver, depending on what was being served. An example would be:
GET /product-category/mens-clothing/jacket/ (then if you clicked on the black leather jacket you would get the next line)
GET /product/black-leather-jacket/

I want to be able to retrieve a category (men's clothing), type (jacket), product (black leather jacket) from the log data so I can do some ML and look at what particular customers are doing. I have a regex to match either but I don't know how to handle both variations.

Hence my question, how do I:

Match both
In the case of the only having the url with the product, how would I then enrich this data with some external source like a database or file? Eg. So when I get the leather jacket URL and then poll the database/file to look for what category and type are associated with the leather jacket, eg jacket and men's clothing?

warkolm · April 4, 2016, 11:56pm

Do you have a pattern for the log already?
If so you should be able to split the URL field into parts.

magnusbaeck · April 5, 2016, 5:34am

In the case of the only having the url with the product, how would I then enrich this data with some external source like a database or file? Eg. So when I get the leather jacket URL and then poll the database/file to look for what category and type are associated with the leather jacket, eg jacket and men's clothing?

Have a look at the translate filter.

nateuni · April 5, 2016, 3:50pm

--- warkolm ---

Yeah I do. But I want to be able to match on both, how do I do that? As they are two different logs, from different user clicks.

With: GET /product-category/mens-clothing/jacket/ - I can see what categories etc they interested.. so the product kv pair would have an empty value.

With: GET /product/black-leather-jacket/ - I would need to enrich the logs by search for what category etc this product relates to and then adding that info to the log. In this case this would give me: the category (men's clothing), type (jacket) and product (black leather jacket).

If that isn't clear let me know.

nateuni · April 5, 2016, 3:53pm

--- Magnus ---

Thanks will do, thx.

warkolm · April 5, 2016, 8:32pm

Can you provide your config, it'll give us better context.

nateuni · April 7, 2016, 1:32pm

Excuse the tardy reply, I had an assessment due.

I was asking a long winded question with my conf file and I kind of figured out how to do the matches as I was asking the question so thanks.

I just need to work on the enrichment now with the database information.

warkolm · April 7, 2016, 9:35pm

Providing your solution may help others in the future

nateuni · April 7, 2016, 10:16pm

Good point... as it wasn't initially apparent on how to do this.

Firstly I have changed the webserver to list dummy financial products (as this is what I am stating I am doing in my thesis). So if the URLs look different from my original post, that's why.

So I have a webserver that logs URLs that I want to parse in Logstash, so I can group them in on a per user basis in Kibana. I also want to get out of a lot of the crap out of the log file (see the last line)

An example log would be:

<IP removed> - - [07/Apr/2016:12:41:38 +0000] "GET /product-category/banking/transaction-account/ HTTP/1.1" 200 6777 http://<IP removed>/product-category/banking/transaction-account/ wp-settings-1=libraryContent%3Dbrowse; wp-settings-time-1=1460008031; wordpress_test_cookie=WP+Cookie+check; wordpress_logged_in_cf8749745210721969771dfabf2df0ea=nathansturgess.pt%7C1460181231%7Czd5rFlzqVwPlbmfKEzJvzVhKeRoNLVfOJPbQNiHAZTs%7858084f4596e3a6863df44a88fbb59Cdc2d2145c62c3c6083519c1cfb8cf47ff2

I want to match on the apache log data, the username in the cookie and on urls that could be in the following format:

/product-category/<category>/ an example would be /product-category/banking/
/product-category/<category>/<subcategory>/ an example would be  /product-category/banking/transction-account/
/product/<product>/ example would be /product/everyday-account/

To do this matching I have the following filter snippet from my logstash.conf file:

filter {
    grok {
        match => { "message" => "\A%{COMMONAPACHELOG}(.*=(?<username>[^%]*)%)" }
    }
    grok {
        match => [ "request", "/product-category\/(?<category>.*?)\/" ]
    }
    grok {
        match => [ "request", "/product-category\/.*?\/(?<subcategory>.*?)\/" ]
    }
    grok {
        match => [ "request", "/product\/(?<product>.*?)\/" ]
    }
    mutate {
        remove_field => ["ident", "auth", "source", "hostname", "name", "host", "beat", "input_type"]
    }
}

Here is a screenshot of a couple of lines of Kibana

I then plan to enrich the logs with only the product with the category and subcategory and from lines with the subcategory to enrich the logs with the category. I need to figure out how to do this.

Joe_Lawson · April 13, 2016, 1:25am

I'd expect the pipeline to be Webserver->Filebeats->Kafka->Logstash->Elasticsearch

warkolm · April 13, 2016, 1:31am

Only the nightly beats builds for 5.0 support direct to kafka.

Joe_Lawson · April 13, 2016, 2:54am

Good to know. I have only looked at master so go figure.

nateuni · April 13, 2016, 1:45pm

NOTE: No one pulled me up on this, but I figured out that the correct way to do the grok on the request is with a one line conditional regex:

filter {
    grok {
        match => { "message" => "\A%{COMMONAPACHELOG}(.*=(?<username>[^%]*)%)" }
    }
    grok {
        match => [ "request", "(/product-category\/(?<category>.*?)\/)|(/product-category\/.*?\/(?<subcategory>.*?)\/)|(/product\/(?<product>.*?)\/)" ]
    }
    mutate {
        remove_field => ["ident", "auth", "source", "hostname", "name", "host", "beat", "input_type"]
    }
}

Topic		Replies	Views
Can Logstash enrich data before send to elasticsearch? Logstash	4	1283	March 27, 2017
How to parse Elasticsearch logs? Logstash	4	1470	July 6, 2017
Logstash - Ajout de patterns Discussions en français	4	1653	July 6, 2017
Data enrichment Huge CPU consumption Logstash	6	1095	July 6, 2017
Log Enrichment with Logstash and the Elastic Filter Plugin - How to Handle Ordering / Relationships? Logstash	7	3065	July 6, 2017

How to correctly parse and then enrich log data in Logstash?

Related topics