How to do a self-join type of query

(Thomas Chung) #1

This is kinda a followup question to this. I'm now indexing individual events and each event has a user field. Say I have these three docs:

{ "user":"abc", "event":"foo" }
{ "user":"abc", "event":"bar" }
{ "user":"def", "event":"foo" }

Say I want all users that have done both "foo" and "bar". In this case, I'd only be interested in "abc". In SQL, I would do a self-join like:

select e1.user 
from event e1, event e2
where e1.event = 'foo' and e2.event = 'bar'
and e1.user = e2.user

I searched the forum and a similar question was asked (with no answers) here.

Again, thanks in advance.

(Mark Harwood) #2

The computation challenge here is if you have many unique users and you have a distributed data (eg employing time-based indices) where the foo and bar events can exist on different machines. The data locality is just not there for this sort of analysis. Building an entity centric index from your event index is often the best approach for any kind of behavioural analysis.

(Thomas Chung) #3

Yep, understood. But an additional question/comment.

For your newbie/hater/fanboy example, you have defined what makes a user a newbie, hater, or fanboy and created a second (entity centric) index classifying those users into one of those groups. That opens up queries where you can say "within fanboys, find me ..." or "within haters, find me..." (which is great).

That gets me part of the way there.

But, say I want to come up with a new classification, "all-star" (people who review a lot and give both good and bad reviews) but I'm not sure what best defines an "all-star". So, I want to explore my data. Maybe it's someone has made 100 reviews, with at least 10 reviews in each of the 1-to-5 star categories. Maybe it's some who has made 100 reviews with 20 reviews each in the 2-to-4 star categories. I just don't know.

I want to give my users the ability to do this type of ad-hoc data exploration. I'm indexing individual events and want to do AND conditions across events. Maybe one search will be for "users who did foo and bar" and another search would be for "users who did foo, bar, and baz".

One approach I can think of (which I feel would work, but smells a little) is to run each AND query separately, save each result set off and do an intersection between the result sets.

Again, appreciate all your help.

(Mark Harwood) #4

Agreed, the “fanboy” example bakes in some logic however you can store other attributes on an entity profile that can be raw ingredients you can assemble in scripts at query time to give you derived properties like your idea for an all-star rating. These attributes can be simple counts you can use in scripts to compute ratios on the fly for an entity.
To use a culinary analogy:

  1. the “fanboy” entity property is a ready-meal
  2. holding basic flags/counts on an entity is like a fridge stocked with useful ingredients and
  3. only having an event-centric index is like just having a gun and a hunting knife

So varying levels of pre-preparation and effort involved in getting results.

(Thomas Chung) #5

Very insightful. Thanks. I won't be able to anticipate every type of query my users may want but I think with a secondary index of flags and counts like you suggested, most cases should be covered. For those that aren't, I can get creative.

(system) closed #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.