Storing and analyzing user agent strings, general approach


(Mark Dodwell) #1

I want to store a bunch of documents in elasticsearch (which represent a
hit to a website) including the user agent of the client that made the
original HTTP request.

Since user agent strings have a lot of variance, and the useful parts need
parsing out (OS, browser, version etc.) I would like to be able to perform
aggregations on those extracted features.

The simplest way I can think to do this would be to analyze the user agent
string before indexing the document. The downside to this approach is as
new/different user agent strings emerge (which is not unlikely) you would
have to proactively update the parser.

This may be impossibly/undesirable for a number of reasons, but what I'd
really like to do is index the raw user agent string and then perform the
analysis/feature extraction post-hoc at query time. Any ideas/pointers on
how to do this?

Aggregators? Custom analyzers? (How would you handle an update to the
analyzer, would you need to re-run against all existing stored data?)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ed9bf030-f9bf-480a-88b1-a80421b9e79e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Patrick Proniewski) #2

Hi,

You should give http://logstash.net/docs/1.4.2/filters/useragent a try before anything else.

Here is the relevant part of logstash.conf I'm using:

filter {
if [type] == "apache" {
if [user-agent] != "-" and [user-agent] != "" {
useragent {
add_tag => [ "UA" ]
source => "user-agent"
}
}
if "UA" in [tags] {
if [device] == "Other" { mutate { remove_field => "device" } }
if [name] == "Other" { mutate { remove_field => "name" } }
if [os] == "Other" { mutate { remove_field => "os" } }
}
}
}

It retains the full user-agent field, and add nice fileds like "device", "major" and "minor" version, "name", "os", "os_name", "os_major" and "os_minor".

sample:
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/6.1.4 Safari/537.76.4",
"name": "Safari",
"os": "Mac OS X 10.8.5",
"os_name": "Mac OS X",
"os_major": "10",
"os_minor": "8",
"major": "6",
"minor": "1",
"patch": "4",

On 26 juin 2014, at 09:09, Mark Dodwell mark@mkdynamic.co.uk wrote:

I want to store a bunch of documents in elasticsearch (which represent a
hit to a website) including the user agent of the client that made the
original HTTP request.

Since user agent strings have a lot of variance, and the useful parts need
parsing out (OS, browser, version etc.) I would like to be able to perform
aggregations on those extracted features.

The simplest way I can think to do this would be to analyze the user agent
string before indexing the document. The downside to this approach is as
new/different user agent strings emerge (which is not unlikely) you would
have to proactively update the parser.

This may be impossibly/undesirable for a number of reasons, but what I'd
really like to do is index the raw user agent string and then perform the
analysis/feature extraction post-hoc at query time. Any ideas/pointers on
how to do this?

Aggregators? Custom analyzers? (How would you handle an update to the
analyzer, would you need to re-run against all existing stored data?)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ed9bf030-f9bf-480a-88b1-a80421b9e79e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/48F3FC4E-D872-43B9-A60D-D2755094AC85%40patpro.net.
For more options, visit https://groups.google.com/d/optout.


(Mark Dodwell) #3

Thanks, lots of useful stuff there.

--

Sent from Mailbox for iPhone

On Thu, Jun 26, 2014 at 12:34 AM, Patrick Proniewski
elasticsearch@patpro.net wrote:

Hi,
You should give http://logstash.net/docs/1.4.2/filters/useragent a try before anything else.
Here is the relevant part of logstash.conf I'm using:
filter {
if [type] == "apache" {
if [user-agent] != "-" and [user-agent] != "" {
useragent {
add_tag => [ "UA" ]
source => "user-agent"
}
}
if "UA" in [tags] {
if [device] == "Other" { mutate { remove_field => "device" } }
if [name] == "Other" { mutate { remove_field => "name" } }
if [os] == "Other" { mutate { remove_field => "os" } }
}
}
}
It retains the full user-agent field, and add nice fileds like "device", "major" and "minor" version, "name", "os", "os_name", "os_major" and "os_minor".
sample:
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/6.1.4 Safari/537.76.4",
"name": "Safari",
"os": "Mac OS X 10.8.5",
"os_name": "Mac OS X",
"os_major": "10",
"os_minor": "8",
"major": "6",
"minor": "1",
"patch": "4",
On 26 juin 2014, at 09:09, Mark Dodwell mark@mkdynamic.co.uk wrote:

I want to store a bunch of documents in elasticsearch (which represent a
hit to a website) including the user agent of the client that made the
original HTTP request.

Since user agent strings have a lot of variance, and the useful parts need
parsing out (OS, browser, version etc.) I would like to be able to perform
aggregations on those extracted features.

The simplest way I can think to do this would be to analyze the user agent
string before indexing the document. The downside to this approach is as
new/different user agent strings emerge (which is not unlikely) you would
have to proactively update the parser.

This may be impossibly/undesirable for a number of reasons, but what I'd
really like to do is index the raw user agent string and then perform the
analysis/feature extraction post-hoc at query time. Any ideas/pointers on
how to do this?

Aggregators? Custom analyzers? (How would you handle an update to the
analyzer, would you need to re-run against all existing stored data?)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ed9bf030-f9bf-480a-88b1-a80421b9e79e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/H-sUPppQMp8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/48F3FC4E-D872-43B9-A60D-D2755094AC85%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1403899436158.1403ba21%40Nodemailer.
For more options, visit https://groups.google.com/d/optout.


(Patrick Proniewski) #4

I just realize that the "user-agent" field comes from my Apache config, where I define a JSON logging format:

LogFormat "{ "@timestamp": "%{%Y-%m-%dT%H:%M:%S%z}t", "message": "%r", "host": "%v", "user-agent": "%{User-agent}i", "client": "%a", "duration_usec": %D, "duration_sec": %T, "status": %s, "size": %B, "request_path": "%U", "request": "%U%q", "method": "%m", "referrer": "%{Referer}i" }" logstash_ext_json

everything else is in my first email.

On 27 juin 2014, at 22:03, Mark Dodwell wrote:

Thanks, lots of useful stuff there.

--

Sent from Mailbox for iPhone

On Thu, Jun 26, 2014 at 12:34 AM, Patrick Proniewski elasticsearch@patpro.net wrote:

Hi,

You should give http://logstash.net/docs/1.4.2/filters/useragent a try before anything else.

Here is the relevant part of logstash.conf I'm using:

filter {
if [type] == "apache" {
if [user-agent] != "-" and [user-agent] != "" {
useragent {
add_tag => [ "UA" ]
source => "user-agent"
}
}
if "UA" in [tags] {
if [device] == "Other" { mutate { remove_field => "device" } }
if [name] == "Other" { mutate { remove_field => "name" } }
if [os] == "Other" { mutate { remove_field => "os" } }
}
}
}

It retains the full user-agent field, and add nice fileds like "device", "major" and "minor" version, "name", "os", "os_name", "os_major" and "os_minor".

sample:
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/6.1.4 Safari/537.76.4",
"name": "Safari",
"os": "Mac OS X 10.8.5",
"os_name": "Mac OS X",
"os_major": "10",
"os_minor": "8",
"major": "6",
"minor": "1",
"patch": "4",

On 26 juin 2014, at 09:09, Mark Dodwell mark@mkdynamic.co.uk wrote:

I want to store a bunch of documents in elasticsearch (which represent a
hit to a website) including the user agent of the client that made the
original HTTP request.

Since user agent strings have a lot of variance, and the useful parts need
parsing out (OS, browser, version etc.) I would like to be able to perform
aggregations on those extracted features.

The simplest way I can think to do this would be to analyze the user agent
string before indexing the document. The downside to this approach is as
new/different user agent strings emerge (which is not unlikely) you would
have to proactively update the parser.

This may be impossibly/undesirable for a number of reasons, but what I'd
really like to do is index the raw user agent string and then perform the
analysis/feature extraction post-hoc at query time. Any ideas/pointers on
how to do this?

Aggregators? Custom analyzers? (How would you handle an update to the
analyzer, would you need to re-run against all existing stored data?)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ed9bf030-f9bf-480a-88b1-a80421b9e79e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/H-sUPppQMp8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/48F3FC4E-D872-43B9-A60D-D2755094AC85%40patpro.net.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1403899436158.1403ba21%40Nodemailer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/DF64C8A3-2285-4B8C-9B56-2201C4649EF1%40patpro.net.
For more options, visit https://groups.google.com/d/optout.


(system) #5