The computation challenge here is if you have many unique users and you have a distributed data (eg employing time-based indices) where the foo and bar events can exist on different machines. The data locality is just not there for this sort of analysis. Building an entity centric index from your event index is often the best approach for any kind of behavioural analysis.
Yep, understood. But an additional question/comment.
For your newbie/hater/fanboy example, you have defined what makes a user a newbie, hater, or fanboy and created a second (entity centric) index classifying those users into one of those groups. That opens up queries where you can say "within fanboys, find me ..." or "within haters, find me..." (which is great).
That gets me part of the way there.
But, say I want to come up with a new classification, "all-star" (people who review a lot and give both good and bad reviews) but I'm not sure what best defines an "all-star". So, I want to explore my data. Maybe it's someone has made 100 reviews, with at least 10 reviews in each of the 1-to-5 star categories. Maybe it's some who has made 100 reviews with 20 reviews each in the 2-to-4 star categories. I just don't know.
I want to give my users the ability to do this type of ad-hoc data exploration. I'm indexing individual events and want to do AND conditions across events. Maybe one search will be for "users who did foo and bar" and another search would be for "users who did foo, bar, and baz".
One approach I can think of (which I feel would work, but smells a little) is to run each AND query separately, save each result set off and do an intersection between the result sets.
Agreed, the “fanboy” example bakes in some logic however you can store other attributes on an entity profile that can be raw ingredients you can assemble in scripts at query time to give you derived properties like your idea for an all-star rating. These attributes can be simple counts you can use in scripts to compute ratios on the fly for an entity.
To use a culinary analogy:
the “fanboy” entity property is a ready-meal
holding basic flags/counts on an entity is like a fridge stocked with useful ingredients and
only having an event-centric index is like just having a gun and a hunting knife
So varying levels of pre-preparation and effort involved in getting results.
Very insightful. Thanks. I won't be able to anticipate every type of query my users may want but I think with a secondary index of flags and counts like you suggested, most cases should be covered. For those that aren't, I can get creative.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.