Graph query across multiple documents

graph

(Alessandro Negro) #1

Hi,
I'm trying to figure out how I can perform the following case study:

I have the Movielens databaset where I have an entry for each rating and in the same way I'm storing data into the elasticsearch. I have a document for each "event" of rating with the details about the user, the movie, the rate, and the time. I would like to create a graph between movies based on the number of co-occcurrence , I mean counting how many users see both of them.

It seems that in order to create a graph I need to create a single or multivalued field on a user document with the list of movies seen.

Thanks in advance,
Alessandro


(Mark Harwood) #2

Correct - that would be my suggestion. In our graphs, the weighting between vertices (in this case movies) are derived on the fly from many documents that contain multiple terms.
The advantage of having a doc per user is that we use the search engine to relevance-rank a set of users whose tastes most closely align with your (potentially many) movie choices and then base recommendations on what appear as "uncommonly common" movie choices in that set of users. We can also apply diversity settings on this sample for gender, country or age etc.


(Alessandro Negro) #3

Thanks Mark for your answer. I supposed so. Now supposing that I can't change my data model, what I'm trying to do is to write an aggregation query that "merge" data into a new index (may be I'll use logstash for creating/updating the new index). Then I'll perform graph query on this new index. Could make sense? I'm still not sure I can do this "concat" aggregation.

Btw, Is there any plan for supporting this type of graph query?

Thanks again,
Alessandro


(Mark Harwood) #4

Bulk exports should be done using the scan-scroll api rather than trying to do this as part of a single search request and using aggregations. This re-orientation of raw event-based data into entity-based data (eg. user profiles) is a common approach for a variety of reasons and I have a talk with links to scripts that can aid this process here: https://youtu.be/yBf7oeJKH2Y?t=5m32s

We can usefully do graph queries on event-centric data such as click logs where each doc has a user's search term and the product they clicked on. In this case the vertices (product codes and searches) are both interesting to return in the form of a graph. We demoed this using BestBuy data in the recent graph webinar.

However, for the event-based docs you have (userX-watched-movieY) the individual users are not interesting vertices to return in a graph. For tiny data volumes on toy datasets they may serve to link related movies into clusters but in real applications individual users are just visual clutter and their many actions over-link so need aggregating and filtering to produce a useful "wisdom of crowds" type result. This is what the Graph API is tuned for by default. We consider the actions of many users and distill the connections down to just the statistically significant associations.
When we say "many users" of course we don't mean every single user who happens to share any one of your likes (say "StarWars") because that would cast the net too wide and dilute any signal. We need a way of ranking users who are most similar to you and selecting the closest ones for analysis. That selection process is exactly what search engines do out of the box with TF-IDF, norms ranking heuristics when we work with user-centric docs.


(Alessandro Negro) #5

Thanks a lot for your replies. I'll test moving from event-based data to entity-based data.


(Mark Harwood) #6

For the record - quick example using the movielens data on a user-centric index:

And the IMDB suggestions for the same Talladega nights movie:


#7

In our graphs, the weighting between vertices (in this case movies) are derived on the fly from many documents that contain multiple terms.
Mark, suppose a list that contains the genres of the movies watched by a user is added to this record. When determining the weight of the connection between genres, does the cardinality of a genre in that list influence the connection's weight somehow or are these lists treated as sets (i.e. having watched 50 horror movies is treated the same as having watched only one)?


(Mark Harwood) #8

I personally wouldn't consider adding genres as a linking device - you can but they are an example of something which is too broad and typically not nuanced enough.

I wouldn't seek to find the strength of connections between genres like "Action" and "Drama" (although I could). It wouldn't tell me much of interest about the world. It's much more informative to examine the behaviours of less-frequent items such as specific movies.

If I was recommending movies to you I would take a set of your movie IDs as a query and search to find the top ~100 people who most closely liked your tastes. Universally popular movies like "Silence of the lambs" wouldn't count highly as a close relationship but your rarer horror choices like "Shaun of the Dead" and "Zombieland" would pull in fans of a perhaps as-yet unnamed sub-genre (in this case RomComZom). Examining these people's lists of movies would likely reveal that 60% of them like "Star Wars" but that is "commonly common" and so not of interest. However, we might find that there are 15 people in the 100 closest people that like the movie "Tucker and Dale vs Evil" - that's 15% of the set but we see only 20 people in the whole dataset of 1m people like that movie so this change in popularity (0.002% -> 15%) is "uncommonly common" and likely to be of interest.

Hopefully this helps show why the individual movie IDs are of more use than broad genres.


#9

Thanks, although I have genre information available that is quite detailed I get your point that more interesting insights can be derived from linking titles directly.


(system) #10