Extract top visited websites from logs


(Hayder Abbass) #1

Hello,

We are storing log data containing information about sites that has been visited from our network. I had like to query the top 10 visited websites. How can I achieve this with ElasticSearch? The index mapping is as follows:

{
    "data" : {
        "properties" : {
            "date": {
              "type" : "date",
              "format" : "yyyy-MM-dd HH:mm:ss"
            },
          "status": {"type" : "string"},
          "group": {"type" : "string"},
          "ip": {"type" : "ip"},
          "username":{"type" : "string"},
          "category":{"type" : "string"},
          "url":{"type" : "string"}
        }
    }
}

Sample Data:

"hits": {
      "total": 7,
      "max_score": 1,
      "hits": [
         {
            "_index": "squid",
            "_type": "data",
            "_id": "AU_DT4_ibdcNyAnt753J",
            "_score": 1,
            "_source": {
               "date": "2015-08-16T00:02:00.195Z",
               "status": "PASS",
               "group": "level3",
               "ip": "10.249.10.49",
               "username": "Hyder",
               "category": "ads",
               "url": "https://gmail.com/mail/u/0/#inbox"
            }
         },
         {
            "_index": "squid",
            "_type": "data",
            "_id": "AU_DMjDpbdcNyAnt75iB",
            "_score": 1,
            "_source": {
               "date": "2015-08-15T00:01:00.195Z",
               "status": "BLOCK",
               "group": "level3",
               "ip": "10.249.10.51",
               "username": "Fary",
               "category": "ads",
               "url": "https://gmail.com/details/blabla"
            }
         },
         {
            "_index": "squid",
            "_type": "data",
            "_id": "AU_DT94kbdcNyAnt753Y",
            "_score": 1,
            "_source": {
               "date": "2015-08-17T00:02:00.195Z",
               "status": "PASS",
               "group": "level3",
               "ip": "10.249.10.49",
               "username": "Hyder",
               "category": "news",
               "url": "http://aol.com"
            }
         },
         {
            "_index": "squid",
            "_type": "data",
            "_id": "AU_CwTEqbdcNyAnt74RJ",
            "_score": 1,
            "_source": {
               "date": "2015-08-15T00:00:00.195Z",
               "status": "PASS",
               "group": "level3",
               "ip": "10.249.10.49",
               "username": "Hyder",
               "category": "Blog",
               "url": "http://gmail.com"
            }
         },
         {
            "_index": "squid",
            "_type": "data",
            "_id": "AU_DMmUQbdcNyAnt75iQ",
            "_score": 1,
            "_source": {
               "date": "2015-08-15T00:02:00.195Z",
               "status": "PASS",
               "group": "level3",
               "ip": "10.249.10.51",
               "username": "Fary",
               "category": "ads",
               "url": "http://yahoo.com/vbfhghfgjfdgfd"
            }
         },
         {
            "_index": "squid",
            "_type": "data",
            "_id": "AU_DT1yjbdcNyAnt753B",
            "_score": 1,
            "_source": {
               "date": "2015-08-16T00:02:00.195Z",
               "status": "REDIR",
               "group": "level3",
               "ip": "10.249.10.49",
               "username": "Hyder",
               "category": "ads",
               "url": "http://news.yahoo.com/"
            }
         },
         {
            "_index": "squid",
            "_type": "data",
            "_id": "AU_DMV1ObdcNyAnt75hd",
            "_score": 1,
            "_source": {
               "date": "2015-08-15T00:01:00.195Z",
               "status": "BLOCK",
               "group": "level3",
               "ip": "10.249.10.50",
               "username": "Kamal",
               "category": "Blog",
               "url": "http://hotmail.com/dfdgfgfdg"
            }
         }
      ]

What I had like to have:

Top visited sites:

Sites - Hits
gmail.com - 3
yahoo.com - 2
aol.com - 1
hotmail.com - 1


(Mark Walkom) #2

I'd suggest that you install Kibana 4, build a visualisation that matches what you want to know, and then grab the query that is sent to ES to build it.

That's probably the easiest way to figure out how to build an agg that I have found.


(Mark Harwood) #3

Three options here:

a) On insertion - your application code that creates the JSON parses out the domain field as a seperate JSON value
b) On indexing - use a "pattern" analyzer to extract the domain from the url
c) At query time - use a script to parse the domain name from URL values

Of these the best choice is option a).
Option b) relies on some funky regex but is notoriously hard - see [1]
Option c) is slower and more of a security risk.

I had a crack at option b) but my regex-skills are not there to debug why it didn't work (I generally try sitting on the keyboard until the right series of strange characters emerges). Here it is anyway:
Mapping:

PUT /test
{
  "settings": {
	"analysis": {
	  "analyzer": {
		"domain": {
		  "type": "pattern",
		  "pattern": "https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])"
		}
	  }
	}
  },"mappings": {
	  "visit":{
		  "properties": {
			  "url":{
				  "type":"string",
				  "index": "no",
				  "fields": {
					  "domain":{
						  "type":"string",
						  "analyzer": "domain",
						  "index": "analyzed"
					  },
					  "raw":{
						  "type":"string",
						  "index": "not_analyzed"
					  }
				  
				  }
			  }
		  }
	  }
  }
}

Query:

GET test/_search
{
	"aggs": {
		"match": {
		   "terms": {
			   "field":"url.domain"
		   }
		}
	}
}

Example for option C below works but will be slower and relies on scripts etc. but this solves the regex hell:

GET test/_search
{
	"aggs": {
		"match": {
		   "terms": {
			   "script":"new URI(doc['url.raw'].value).getHost()"
		   }
		}
	}
}

[1] http://stackoverflow.com/questions/863297/regular-expression-to-retrieve-domain-tld


(Nimit Kalal) #4

i have the same problem but my data is coming from logstash so i have to go for option c but what i want to know what type scrip i have to write


(Mark Walkom) #6