I have clickstream data about my website. The events come in as the user clicks/interacts with the website. For example a sampling of the data looks like this. I have millions of data points like this coming from millions of different ip addresses
I want to be able to do query based on ip addresses. I want to be able to create charts that says "What is the average number of products that the user buys". When l look at visualization options in Kibana, I can create buckets based off of IP addresses, but I cannot seem to do filters or complex analysis like my example above.
I assume this is because of the way I have stored by document in ES. What is the best structure or optimal structure for click stream data like above that will help me ask complex queries like I have mentioned
I assume this is because of the way I have stored by document in ES. What is the best structure or optimal structure for click stream data like above that will help me ask complex queries like I have mentioned
I believe that the structure of the document you have so far is ok.
When l look at visualization options in Kibana, I can create buckets based off of IP addresses, but I cannot seem to do filters or complex analysis like my example above.
So a terms agg on clientip does not give the results you would expect?
What kind of JSon response you would like to see?
When i bucket it by ip address, I get millions of buckets (as there are many visitors). And I want to be able to do analysis like "What is the average number of URLS visited by users before purchasing a product". When bucketing by ip address, am unable to perform such analysis. Which is the reason I was wondering if I was storing the documents incorrectly
I wonder if shifting to a user-centric model or session-centric (vs event-centric) would help here. Watching the video, it seems like he's trying to answer the same sorts of questions.
Also, consider using a hash of the IP+UserAgent as a heuristic to determine what a unique anonymous "user" is. IP's will clump up around proxies, but exact user agents duplicates are more rare.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.