Searchable list of jobs sorted by user relevance

I have a list of jobs, queryable by the user that i want to be able to sort
according to the relevance for the logged in user. There are roughly 100
000 jobs and more users. How should i approach this indexing? Is attaching
all user relevance data for a specific job to that job's document such a
good idea?

Also considering using persistent storage mode for the relevance data, and
storing only in ES. Would that affect the optimal setup?

--

How exactly are you representing the relevance data?

I would go about this by doing something like: each user has a list of
keywords/phrases that represent their interests. This is stored alongside
their user profile in your database. Then you could just construct a match
or mlt query out of those words/phrases and query the jobs index with it.
Any additional criteria the user applies at search time could just be added
as filters.

Or have you already pre-calculated job<->user relevance using a batch
process somewhere else? In that case you probably don't need Elasticsearch
for this part :slight_smile: A row in a key-value store with an ordered list of job
IDs for each user would do, right?

On Saturday, 6 October 2012 22:49:16 UTC+1, Joakim Ekström wrote:

I have a list of jobs, queryable by the user that i want to be able to
sort according to the relevance for the logged in user. There are roughly
100 000 jobs and more users. How should i approach this indexing? Is
attaching all user relevance data for a specific job to that job's document
such a good idea?

Also considering using persistent storage mode for the relevance data, and
storing only in ES. Would that affect the optimal setup?

--

Oh, sorry. The relevance data is precalculated and currently stored in
mysql. job_id, user_id, relevance.

My issue is strictly regarding how to be able to index this in the optimal
way, in order for it to be sortable by.

Den söndagen den 7:e oktober 2012 kl. 12:03:21 UTC+2 skrev Andrew Clegg:

How exactly are you representing the relevance data?

I would go about this by doing something like: each user has a list of
keywords/phrases that represent their interests. This is stored alongside
their user profile in your database. Then you could just construct a match
or mlt query out of those words/phrases and query the jobs index with it.
Any additional criteria the user applies at search time could just be added
as filters.

Or have you already pre-calculated job<->user relevance using a batch
process somewhere else? In that case you probably don't need Elasticsearch
for this part :slight_smile: A row in a key-value store with an ordered list of job
IDs for each user would do, right?

On Saturday, 6 October 2012 22:49:16 UTC+1, Joakim Ekström wrote:

I have a list of jobs, queryable by the user that i want to be able to
sort according to the relevance for the logged in user. There are roughly
100 000 jobs and more users. How should i approach this indexing? Is
attaching all user relevance data for a specific job to that job's document
such a good idea?

Also considering using persistent storage mode for the relevance data,
and storing only in ES. Would that affect the optimal setup?

--

I'm still not sure why you need ES at all in this case, can't you just get
a ranked list of jobs for that user from mysql?

But if there's a reason why you need to combine relevance with an ES query
(e.g. for additional filtering), maybe the best way would be to have a
separate type called "relevance" with "user_id" and "score" fields, and set
the job type as its _parent.

Then do something like described here:

in the "Ordering" section.

On Sunday, 7 October 2012 11:44:57 UTC+1, Joakim Ekström wrote:

Oh, sorry. The relevance data is precalculated and currently stored in
mysql. job_id, user_id, relevance.

My issue is strictly regarding how to be able to index this in the optimal
way, in order for it to be sortable by.

Den söndagen den 7:e oktober 2012 kl. 12:03:21 UTC+2 skrev Andrew Clegg:

How exactly are you representing the relevance data?

I would go about this by doing something like: each user has a list of
keywords/phrases that represent their interests. This is stored alongside
their user profile in your database. Then you could just construct a match
or mlt query out of those words/phrases and query the jobs index with it.
Any additional criteria the user applies at search time could just be added
as filters.

Or have you already pre-calculated job<->user relevance using a batch
process somewhere else? In that case you probably don't need Elasticsearch
for this part :slight_smile: A row in a key-value store with an ordered list of job
IDs for each user would do, right?

On Saturday, 6 October 2012 22:49:16 UTC+1, Joakim Ekström wrote:

I have a list of jobs, queryable by the user that i want to be able to
sort according to the relevance for the logged in user. There are roughly
100 000 jobs and more users. How should i approach this indexing? Is
attaching all user relevance data for a specific job to that job's document
such a good idea?

Also considering using persistent storage mode for the relevance data,
and storing only in ES. Would that affect the optimal setup?

--

I use ES for full text- and faceted search.

Will look into that. Thanks.

Den söndagen den 7:e oktober 2012 kl. 13:10:50 UTC+2 skrev Andrew Clegg:

I'm still not sure why you need ES at all in this case, can't you just get
a ranked list of jobs for that user from mysql?

But if there's a reason why you need to combine relevance with an ES query
(e.g. for additional filtering), maybe the best way would be to have a
separate type called "relevance" with "user_id" and "score" fields, and set
the job type as its _parent.

Then do something like described here:

Fun with elasticsearch's children and nested documents - Space Vatican

in the "Ordering" section.

On Sunday, 7 October 2012 11:44:57 UTC+1, Joakim Ekström wrote:

Oh, sorry. The relevance data is precalculated and currently stored in
mysql. job_id, user_id, relevance.

My issue is strictly regarding how to be able to index this in the
optimal way, in order for it to be sortable by.

Den söndagen den 7:e oktober 2012 kl. 12:03:21 UTC+2 skrev Andrew Clegg:

How exactly are you representing the relevance data?

I would go about this by doing something like: each user has a list of
keywords/phrases that represent their interests. This is stored alongside
their user profile in your database. Then you could just construct a match
or mlt query out of those words/phrases and query the jobs index with it.
Any additional criteria the user applies at search time could just be added
as filters.

Or have you already pre-calculated job<->user relevance using a batch
process somewhere else? In that case you probably don't need Elasticsearch
for this part :slight_smile: A row in a key-value store with an ordered list of job
IDs for each user would do, right?

On Saturday, 6 October 2012 22:49:16 UTC+1, Joakim Ekström wrote:

I have a list of jobs, queryable by the user that i want to be able to
sort according to the relevance for the logged in user. There are roughly
100 000 jobs and more users. How should i approach this indexing? Is
attaching all user relevance data for a specific job to that job's document
such a good idea?

Also considering using persistent storage mode for the relevance data,
and storing only in ES. Would that affect the optimal setup?

--

Actually, if you use ES as an IR engine instead of a filtering paradigm, it
makes sense to put 1 relevance metric in the corpus for when you are
calculating relevance in your queries. Tho very few ES users talk about
relevance. You can leverage the external numbers concurrrently with built
in similarity by writing a Lucene level custom similarity object. Lucene
will load it for you, unless that has been neutered in ES. I've done it
multiple times in other custom api's for lucene.

Your large cardinality relevance calc could be stored several ways and stll
be available to ES queries. Given the performance of ES I've seen at
numbers up to 200 M docs, I doubt a call to another persistent store would
kill your performance, as long as you batch it with known docids for the
query result.

re.

On Sunday, October 7, 2012 7:27:34 AM UTC-4, Joakim Ekström wrote:

I use ES for full text- and faceted search.

Will look into that. Thanks.

Den söndagen den 7:e oktober 2012 kl. 13:10:50 UTC+2 skrev Andrew Clegg:

I'm still not sure why you need ES at all in this case, can't you just
get a ranked list of jobs for that user from mysql?

But if there's a reason why you need to combine relevance with an ES
query (e.g. for additional filtering), maybe the best way would be to have
a separate type called "relevance" with "user_id" and "score" fields, and
set the job type as its _parent.

Then do something like described here:

Fun with elasticsearch's children and nested documents - Space Vatican

in the "Ordering" section.

On Sunday, 7 October 2012 11:44:57 UTC+1, Joakim Ekström wrote:

Oh, sorry. The relevance data is precalculated and currently stored in
mysql. job_id, user_id, relevance.

My issue is strictly regarding how to be able to index this in the
optimal way, in order for it to be sortable by.

Den söndagen den 7:e oktober 2012 kl. 12:03:21 UTC+2 skrev Andrew Clegg:

How exactly are you representing the relevance data?

I would go about this by doing something like: each user has a list of
keywords/phrases that represent their interests. This is stored alongside
their user profile in your database. Then you could just construct a match
or mlt query out of those words/phrases and query the jobs index with it.
Any additional criteria the user applies at search time could just be added
as filters.

Or have you already pre-calculated job<->user relevance using a batch
process somewhere else? In that case you probably don't need Elasticsearch
for this part :slight_smile: A row in a key-value store with an ordered list of job
IDs for each user would do, right?

On Saturday, 6 October 2012 22:49:16 UTC+1, Joakim Ekström wrote:

I have a list of jobs, queryable by the user that i want to be able to
sort according to the relevance for the logged in user. There are roughly
100 000 jobs and more users. How should i approach this indexing? Is
attaching all user relevance data for a specific job to that job's document
such a good idea?

Also considering using persistent storage mode for the relevance data,
and storing only in ES. Would that affect the optimal setup?

--