What should be structure of document for Clickstream data based on IP addresses

ikarthikb · February 28, 2018, 1:14pm

I have clickstream data about my website. The events come in as the user clicks/interacts with the website. For example a sampling of the data looks like this. I have millions of data points like this coming from millions of different ip addresses

<event url="https://bla.com/home" action="touch" name="touch on products button" ip="12.12.12.12" userAgent="user agent string" avgTime=9></event>
<event url="https://bla.com/shop" action="touch" name="touch on productId2 button" ip="12.12.12.12" userAgent="user agent string" avgTime=19></event>
<event url="https://bla.com/shop" action="touch" name="touch on productId2 button" ip="12.12.12.12" userAgent="user agent string" avgTime=19></event>
<event url="https://bla.com/purchase" action="touch" name="touch on purchase button" ip="12.12.12.12" userAgent="user agent string" avgTime=19></event>
<event url="https://bla.com/shop" action="touch" name="touch on products button" ip="22.22.12.12" userAgent="user agent string" avgTime=19></event>

I want to be able to do query based on ip addresses. I want to be able to create charts that says "What is the average number of products that the user buys". When l look at visualization options in Kibana, I can create buckets based off of IP addresses, but I cannot seem to do filters or complex analysis like my example above.

I assume this is because of the way I have stored by document in ES. What is the best structure or optimal structure for click stream data like above that will help me ask complex queries like I have mentioned

Thanks
K

dadoonet · February 28, 2018, 3:22pm

What are exactly the JSON documents looking like?

ikarthikb · March 1, 2018, 1:41pm

Here is full structure of the document

{
  "_index": "logstash-2018.03.01",
  "_type": "doc",
  "_id": "586228704",
  "_score": 1,
  "_source": {
    "date": "Thu Mar 01 07:32:54 CST 2018",
    "Action": "login",
    "agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 11_2_6 like Mac OS X) AppleWebKit/604.5.6 (KHTML, like Gecko) Version/11.0 Mobile/15D100 Safari/604.1",
    "geoip": {
      "continent_name": "North America",
      "city_name": "Texas",
      "country_iso_code": "US",
      "region_name": "Austin",
      "location": {
        "lon": -72.7503,
        "lat": 32.4854
      }
    },
    "type": "someUseraction",
    "url": "https://myurl/somepath/bla",
    "Name": "touch on productid1",
    "@timestamp": "2017-11-01T13:32:54.000Z",
    "clientip": "12.12.12.12",
    "@version": "1",
    "responsetime": "0.002",
    "user_agent": {
      "major": "11",
      "minor": "0",
      "os": "iOS 11.2.6",
      "os_minor": "2",
      "os_major": "11",
      "name": "Mobile Safari",
      "os_name": "iOS",
      "device": "iPhone"
    }
  },
  "fields": {
    "@timestamp": [
      "2018-03-01T13:32:54.000Z"
    ]
  }
}

dadoonet · March 1, 2018, 2:37pm

I assume this is because of the way I have stored by document in ES. What is the best structure or optimal structure for click stream data like above that will help me ask complex queries like I have mentioned

I believe that the structure of the document you have so far is ok.

When l look at visualization options in Kibana, I can create buckets based off of IP addresses, but I cannot seem to do filters or complex analysis like my example above.

So a terms agg on clientip does not give the results you would expect?
What kind of JSon response you would like to see?

ikarthikb · March 1, 2018, 3:04pm

When i bucket it by ip address, I get millions of buckets (as there are many visitors). And I want to be able to do analysis like "What is the average number of URLS visited by users before purchasing a product". When bucketing by ip address, am unable to perform such analysis. Which is the reason I was wondering if I was storing the documents incorrectly

dadoonet · March 8, 2018, 10:09pm

May be @jpountz could have an idea on how to solve your use case.

loren · March 8, 2018, 10:54pm

I wonder if shifting to a user-centric model or session-centric (vs event-centric) would help here. Watching the video, it seems like he's trying to answer the same sorts of questions.

Also, consider using a hash of the IP+UserAgent as a heuristic to determine what a unique anonymous "user" is. IP's will clump up around proxies, but exact user agents duplicates are more rare.

system · April 5, 2018, 10:54pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.