Social graph distance based scoring using ES


We want to implement a personalized user search using ES. The idea is that we have a graph in which nodes represent users and edges represent trust, on top of that we want to let users search for other users such that the result set of users will be scored according to the distance in the graph from the user that issued the query.

For example, if we have the following trust graph (let's assume a unidirectional friendship model):
A->B->C->D (A trusts B, B trusts C, C trusts D)
A->E->F (A trusts E, E trusts F)

When user A issues a query (let's assume some query on some field) and both F and D matches the query with the same "stock score", since F is closer to A in the graph, F should be generally ranked higher for user A and should appear higher in the search results.

We are looking for how to organize the data so that we can support such queries and in general any ideas of how to go about this if even feasible using ES.


Sounds like what you want is “shortest path” type logic where it’s a weighted graph being used to measure distance.
The awkward questions are :
What happens if F and A are 60 hops apart and require the whole graph to be sucked into RAM?
What do you do about the “super connectors” found in many graphs - nodes that join up many others? What filters do you apply?

It’s a potentially expensive task and tools like neo4j optimise for this use case by converting IDs like F and A to pointers on insert and insist on using only one machine with a lot of RAM.

Yes, I am interested in search where the shortest path has a significant effect on the score. It might be possible to limit the number of hops to something like 5 in my case, but still 5th level could be a lot (even 3 level, look at linkedin). I've been looking at graph based databases line neo4j, titan (janusgrapg), orientdb, etc. The missing piece there is full fledged search.

The x-pack extension to elastic stack has a Graph api but it is primarily designed to play to our strengths (using relevance ranking to prioritise discovery of the strongest connections first)
It’s not designed for shortest path analysis because, as you suggest, 5 hops could be the whole dataset and that doesn’t work well in a distributed system. Neo4j and others should allow for a certain level of search using nodes with indexed properties and have been optimised for shortest path analysis by keeping all data on one machine and using pointer-based links

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.