How to do a self-join type of query

tchung · January 16, 2019, 12:08am

This is kinda a followup question to this. I'm now indexing individual events and each event has a user field. Say I have these three docs:

{ "user":"abc", "event":"foo" }

{ "user":"abc", "event":"bar" }

{ "user":"def", "event":"foo" }

Say I want all users that have done both "foo" and "bar". In this case, I'd only be interested in "abc". In SQL, I would do a self-join like:

select e1.user 
from event e1, event e2
where e1.event = 'foo' and e2.event = 'bar'
and e1.user = e2.user

I searched the forum and a similar question was asked (with no answers) here.

Again, thanks in advance.

Mark_Harwood · January 16, 2019, 9:54am

The computation challenge here is if you have many unique users and you have a distributed data (eg employing time-based indices) where the foo and bar events can exist on different machines. The data locality is just not there for this sort of analysis. Building an entity centric index from your event index is often the best approach for any kind of behavioural analysis.

tchung · January 16, 2019, 7:00pm

Yep, understood. But an additional question/comment.

For your newbie/hater/fanboy example, you have defined what makes a user a newbie, hater, or fanboy and created a second (entity centric) index classifying those users into one of those groups. That opens up queries where you can say "within fanboys, find me ..." or "within haters, find me..." (which is great).

That gets me part of the way there.

But, say I want to come up with a new classification, "all-star" (people who review a lot and give both good and bad reviews) but I'm not sure what best defines an "all-star". So, I want to explore my data. Maybe it's someone has made 100 reviews, with at least 10 reviews in each of the 1-to-5 star categories. Maybe it's some who has made 100 reviews with 20 reviews each in the 2-to-4 star categories. I just don't know.

I want to give my users the ability to do this type of ad-hoc data exploration. I'm indexing individual events and want to do AND conditions across events. Maybe one search will be for "users who did foo and bar" and another search would be for "users who did foo, bar, and baz".

One approach I can think of (which I feel would work, but smells a little) is to run each AND query separately, save each result set off and do an intersection between the result sets.

Again, appreciate all your help.

Mark_Harwood · January 16, 2019, 7:59pm

Agreed, the “fanboy” example bakes in some logic however you can store other attributes on an entity profile that can be raw ingredients you can assemble in scripts at query time to give you derived properties like your idea for an all-star rating. These attributes can be simple counts you can use in scripts to compute ratios on the fly for an entity.
To use a culinary analogy:

the “fanboy” entity property is a ready-meal
holding basic flags/counts on an entity is like a fridge stocked with useful ingredients and
only having an event-centric index is like just having a gun and a hunting knife

So varying levels of pre-preparation and effort involved in getting results.

tchung · January 17, 2019, 6:56pm

Very insightful. Thanks. I won't be able to anticipate every type of query my users may want but I think with a secondary index of flags and counts like you suggested, most cases should be covered. For those that aren't, I can get creative.

Topic		Replies	Views
Is it possible to write it into single ES query ? (SQL: self join) Elasticsearch	0	1134	March 25, 2015
How to get data from Elasticsearch like SQL self join Elasticsearch	1	2904	September 19, 2015
Any idea about Join query Elasticsearch	4	512	November 20, 2019
Best method of handling arbitrary document joins Elasticsearch	8	1368	May 4, 2018
Term query with search result Elasticsearch	3	839	May 3, 2016

How to do a self-join type of query

Related topics