Using elastic search with per-user ordering of results

I'm looking at using elastic search for our application. We do quite a lot
of filtering on numerical attributes (for example the items have a
category_id and feature_ids properties and a lot of searches will be
filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of thousands of
documents, possibly low 100s of thousands of documents. Each user of the
system has a certain number (10-20k) of documents recommended to them with
a score given by a recommendation algorithm I'm implemented. The ordering
is different for each user. What I'd like to be able to do is present
results ordered by that score but using elastic search to filter by
category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the
maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via custom_score. With a
    simplified version of the scoring algorithm this was taking 2-3s for a
    search which isn't fast enough

  • precalculating recommendation scores (which we already do), and passing a
    map of {document_id => score} as the parameter to a custom_score script.
    That script is then very simple - basically
    params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each search request
    is pretty chunky because we have to keep sending the map od scores over. I
    naively expect that this would get faster when more selective filters made
    the result set smaller but that doesn't seem to be the case. On my
    development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a number of
    nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents that are
recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would
sort documents by score in this strategy - would I have to use a nested
query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of course
change. And those changes need to be reflected quickly: if a user edits
their preferences to say that they hate the colour red but their
recommendations still show lots of red things they get frustrated. New
users also need to be able to get search results quickly. This is
straightforward with the first 2 approaches, but less so for the third I've
had a look at the api for updates but I'm sure if I could efficiently
update/remove all the nested documents for a single user and what
performance impact this might have. The documentations hints at the fact
that the nested documents are stored/indexed separately but it seems like
this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of wisdom on
what approaches might by effective ?

Thanks,

Fred

  • Have you tried to implement the custom_score as a "native" script (using Java). It will speed things up. This would be the simplest option.
  • The second option, does that mean that you send a map of 20k docId->score for each search request?
  • Nested will probably won't help, since you can't sort on nested doc values...

On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:

I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough

  • precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a number of nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?

Thanks,

Fred

On 6 Mar 2012, at 20:23, Shay Banon kimchy@gmail.com wrote:

  • Have you tried to implement the custom_score as a "native" script (using Java). It will speed things up. This would be the simplest option.

Not yet. Will give it a go. Evaluating a score is relatively simple - it's effectively the dot product of a vector of per garment floats with a vector of per user floats (about a 100 or so). Is java likely to be much faster than mvel at this sort of task?

  • The second option, does that mean that you send a map of 20k docId->score for each search request?

Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.

  • Nested will probably won't help, since you can't sort on nested doc values...

Won't head any more down that dead-end then. Thanks.

On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:

I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough

  • precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a number of nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?

Thanks,

Fred

On Tuesday, March 6, 2012 at 11:16 PM, Frederick Cheung wrote:

On 6 Mar 2012, at 20:23, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:

  • Have you tried to implement the custom_score as a "native" script (using Java). It will speed things up. This would be the simplest option.

Not yet. Will give it a go. Evaluating a score is relatively simple - it's effectively the dot product of a vector of per garment floats with a vector of per user floats (about a 100 or so). Is java likely to be much faster than mvel at this sort of task?
It will be faster, yes. How much faster depends on the type of script and calculations you do (hard to answer).

  • The second option, does that mean that you send a map of 20k docId->score for each search request?

Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time.

  • Nested will probably won't help, since you can't sort on nested doc values...

Won't head any more down that dead-end then. Thanks.

On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:

I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough

  • precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a number of nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?

Thanks,

Fred

On 7 Mar 2012, at 11:24, Shay Banon wrote:

  • The second option, does that mean that you send a map of 20k docId->score for each search request?

Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time

I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).

Thanks.
Fred

.

  • Nested will probably won't help, since you can't sort on nested doc values...

Won't head any more down that dead-end then. Thanks.

On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:

I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough

  • precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a number of nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?

Thanks,

Fred

Can you share the code you did so other users will benefit from it? It would be a great sample.

On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:

On 7 Mar 2012, at 11:24, Shay Banon wrote:

  • The second option, does that mean that you send a map of 20k docId->score for each search request?

Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time

I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).

Thanks.
Fred

.

  • Nested will probably won't help, since you can't sort on nested doc values...

Won't head any more down that dead-end then. Thanks.

On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:

I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough

  • precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a number of nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?

Thanks,

Fred

Attachments:

  • smime.p7s

Sure, will do tomorrow. Been a few years since I've written any java, so probably not the prettiest java you've ever seen!

Sent from my iPhone

On 7 Mar 2012, at 21:02, Shay Banon kimchy@gmail.com wrote:

Can you share the code you did so other users will benefit from it? It would be a great sample.
On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:

On 7 Mar 2012, at 11:24, Shay Banon wrote:

  • The second option, does that mean that you send a map of 20k docId->score for each search request?

Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time

I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).

Thanks.
Fred

.

  • Nested will probably won't help, since you can't sort on nested doc values...

Won't head any more down that dead-end then. Thanks.

On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:

I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough

  • precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a number of nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?

Thanks,

Fred

Attachments:

  • smime.p7s

On 7 Mar 2012, at 21:02, Shay Banon wrote:

Can you share the code you did so other users will benefit from it? It would be a great sample.

I've put it in a gist at gist:dd996f7b4a5529162199 · GitHub

The production code has some extra knobs on do with testing whether the cached data is stale with respect to what's in mongo and a few other bits and bobs but I believe the gist demonstrates the general workings. Apologies again for my rusty java!

Fred

On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:

On 7 Mar 2012, at 11:24, Shay Banon wrote:

  • The second option, does that mean that you send a map of 20k docId->score for each search request?

Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time

I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).

Thanks.
Fred

.

  • Nested will probably won't help, since you can't sort on nested doc values...

Won't head any more down that dead-end then. Thanks.

On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:

I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough

  • precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a number of nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?

Thanks,

Fred

Attachments:

  • smime.p7s

Ahh, you query mongo right from the script if you need to populate the cache, cool!. One small optimization option, elasticsearch has trove embedded in it (native types data structures), so you can use it to create the map from int (id) -> BasicDBObject. Also, I would suggest just to use a collection that does int->int, and not store the mongo DBObject (extra memory).

Another note, you use LinkedHashMap to do the caching. Note that its not thread safe by default. You can wrap it with synchronized map, but, you can use Google Guave CacheBuilder (also embedded in elasticsearch) to build a cache that evicts based on number of entries and that is still concurrent.

On Thursday, March 8, 2012 at 12:40 AM, Frederick Cheung wrote:

On 7 Mar 2012, at 21:02, Shay Banon wrote:

Can you share the code you did so other users will benefit from it? It would be a great sample.

I've put it in a gist at gist:dd996f7b4a5529162199 · GitHub

The production code has some extra knobs on do with testing whether the cached data is stale with respect to what's in mongo and a few other bits and bobs but I believe the gist demonstrates the general workings. Apologies again for my rusty java!

Fred

On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:

On 7 Mar 2012, at 11:24, Shay Banon wrote:

  • The second option, does that mean that you send a map of 20k docId->score for each search request?

Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time

I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).

Thanks.
Fred

.

  • Nested will probably won't help, since you can't sort on nested doc values...

Won't head any more down that dead-end then. Thanks.

On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:

I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough

  • precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a number of nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?

Thanks,

Fred

Attachments:

  • smime.p7s

Attachments:

  • smime.p7s

On 8 Mar 2012, at 21:16, Shay Banon wrote:

Ahh, you query mongo right from the script if you need to populate the cache, cool!. One small optimization option, elasticsearch has trove embedded in it (native types data structures), so you can use it to create the map from int (id) -> BasicDBObject. Also, I would suggest just to use a collection that does int->int, and not store the mongo DBObject (extra memory).

Another note, you use LinkedHashMap to do the caching. Note that its not thread safe by default. You can wrap it with synchronized map, but, you can use Google Guave CacheBuilder (also embedded in elasticsearch) to build a cache that evicts based on number of entries and that is still concurrent.

Thanks for the tips. Other than my ignorance of more concurrency friendly cache structures the reasons for putting a synchronised(this){} around fetching/populating of the cache is that otherwise mongo was actually getting hit once for each shard as they all rushed to try and populate the cache.

Fre

On Thursday, March 8, 2012 at 12:40 AM, Frederick Cheung wrote:

On 7 Mar 2012, at 21:02, Shay Banon wrote:

Can you share the code you did so other users will benefit from it? It would be a great sample.

I've put it in a gist at gist:dd996f7b4a5529162199 · GitHub

The production code has some extra knobs on do with testing whether the cached data is stale with respect to what's in mongo and a few other bits and bobs but I believe the gist demonstrates the general workings. Apologies again for my rusty java!

Fred

On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:

On 7 Mar 2012, at 11:24, Shay Banon wrote:

  • The second option, does that mean that you send a map of 20k docId->score for each search request?

Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time

I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).

Thanks.
Fred

.

  • Nested will probably won't help, since you can't sort on nested doc values...

Won't head any more down that dead-end then. Thanks.

On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:

I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough

  • precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a number of nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?

Thanks,

Fred

Attachments:

  • smime.p7s

Attachments:

  • smime.p7s

Ahh, yea, check the Cache constructs in Google Guava, they allow to plug in logic that loads the value for a key "only once". See more here: http://docs.guava-libraries.googlecode.com/git-history/v11.0.2/javadoc/com/google/common/cache/CacheBuilder.html. (google guava cache is embedded in ES under org.elasticsearch.common.cache).

On Thursday, March 8, 2012 at 12:40 AM, Frederick Cheung wrote:

On 7 Mar 2012, at 21:02, Shay Banon wrote:

Can you share the code you did so other users will benefit from it? It would be a great sample.

I've put it in a gist at gist:dd996f7b4a5529162199 · GitHub

The production code has some extra knobs on do with testing whether the cached data is stale with respect to what's in mongo and a few other bits and bobs but I believe the gist demonstrates the general workings. Apologies again for my rusty java!

Fred

On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:

On 7 Mar 2012, at 11:24, Shay Banon wrote:

  • The second option, does that mean that you send a map of 20k docId->score for each search request?

Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time

I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).

Thanks.
Fred

.

  • Nested will probably won't help, since you can't sort on nested doc values...

Won't head any more down that dead-end then. Thanks.

On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:

I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough

  • precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a number of nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?

Thanks,

Fred

Attachments:

  • smime.p7s

Attachments:

  • smime.p7s

On 8 Mar 2012, at 21:56, Shay Banon wrote:

Ahh, yea, check the Cache constructs in Google Guava, they allow to plug in logic that loads the value for a key "only once". See more here: http://docs.guava-libraries.googlecode.com/git-history/v11.0.2/javadoc/com/google/common/cache/CacheBuilder.html. (google guava cache is embedded in ES under org.elasticsearch.common.cache).

Implemented both of your tips this morning. Changing to an int->int map has probably shaved off a few percent and the google guava cache is way nicer than my botch job and handles all my needs (including refreshes - I do optimistic refreshes by querying mongo for recommendation_score_maps.find(profile_id: 1234, updated_at: {$gt: last_updated_at) - if I get no records back then I know my cached data is still fresh and don't need to rebuild the int->int map)

Thanks for the tips!

Fred

On Thursday, March 8, 2012 at 12:40 AM, Frederick Cheung wrote:

On 7 Mar 2012, at 21:02, Shay Banon wrote:

Can you share the code you did so other users will benefit from it? It would be a great sample.

I've put it in a gist at gist:dd996f7b4a5529162199 · GitHub

The production code has some extra knobs on do with testing whether the cached data is stale with respect to what's in mongo and a few other bits and bobs but I believe the gist demonstrates the general workings. Apologies again for my rusty java!

Fred

On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:

On 7 Mar 2012, at 11:24, Shay Banon wrote:

  • The second option, does that mean that you send a map of 20k docId->score for each search request?

Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time

I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).

Thanks.
Fred

.

  • Nested will probably won't help, since you can't sort on nested doc values...

Won't head any more down that dead-end then. Thanks.

On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:

I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough

  • precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a number of nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?

Thanks,

Fred

Attachments:

  • smime.p7s

Attachments:

  • smime.p7s

Apologies for reviving an old thread.

I have a setup, where I require a per-user ordering much like Frederick
describes. We have a score-table in MySQL, with a score per document per
user, and we would like to sort search results based on the specific user's
scores.
As an alternative to the third option Frederick describes (nested
documents), we've considered adding a field per user to each document with
the name "score{user_id}", e.g. "_score_5327" for the user with id 5327.
Quering elasticsearch on behalf of that user, then requires specifying
"sort": { "_score_5327": { "order": "desc" } }. By keeping the per-user
score on the root document, we sidestep the problem of not being able to
sort on nested document fields. We can keep the scores up-to-date with the
partial update API
(Elasticsearch Platform — Find real-time answers at scale | Elastic).

Are there any apparent problems with this approach? Can ES handle the many
extra fields (~3,000 scores for the most popular document)? How does it
compare to writing a plugin, like suggested above?

Thanks,
Andreas

On Friday, 9 March 2012 11:33:59 UTC+1, Frederick Cheung wrote:

On 8 Mar 2012, at 21:56, Shay Banon wrote:

Ahh, yea, check the Cache constructs in Google Guava, they allow to plug
in logic that loads the value for a key "only once". See more here:
http://docs.guava-libraries.googlecode.com/git-history/v11.0.2/javadoc/com/google/common/cache/CacheBuilder.html.
(google guava cache is embedded in ES under org.elasticsearch.common.cache).

Implemented both of your tips this morning. Changing to an int->int map
has probably shaved off a few percent and the google guava cache is way
nicer than my botch job and handles all my needs (including refreshes - I
do optimistic refreshes by querying mongo for
recommendation_score_maps.find(profile_id: 1234, updated_at: {$gt:
last_updated_at) - if I get no records back then I know my cached data is
still fresh and don't need to rebuild the int->int map)

Thanks for the tips!

Fred

On Thursday, March 8, 2012 at 12:40 AM, Frederick Cheung wrote:

On 7 Mar 2012, at 21:02, Shay Banon wrote:

Can you share the code you did so other users will benefit from it? It
would be a great sample.

I've put it in a gist at gist:dd996f7b4a5529162199 · GitHub

The production code has some extra knobs on do with testing whether the
cached data is stale with respect to what's in mongo and a few other bits
and bobs but I believe the gist demonstrates the general workings.
Apologies again for my rusty java!

Fred

On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:

On 7 Mar 2012, at 11:24, Shay Banon wrote:

  • The second option, does that mean that you send a map of 20k
    docId->score for each search request?

Yes. Yuck. I can probably cut down the map size based on filters in
some cases but not all. The data for each user's map is stored as a single
mongo document so I might try loading the map inside a native script rather
than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those
parameters :). With the Java implementation, you could have the script
cache the mentioned map (on the factory level, which is singleton in the
node level). You could potentially build something that throws a failure if
the map for a specific user is not there, and you can catch that failure
(with a specific message), and then know that you need to send the map to
reinitialize the cache. That way, you don't have to send the map each time

I've been going down that route and it is a lot, lot faster than
passing the parameters each time - it didn't occur to me that so much of
the elapsed time was spent handling the parameters. With a warm cache
(which should occur relatively often - users very rarely view just one
page) I'm getting results in around 15ms which is more than fast enough
(and way faster & more flexible than our previous home grown non elastic
search solution).

Thanks.
Fred

.

  • Nested will probably won't help, since you can't sort on nested
    doc values...

Won't head any more down that dead-end then. Thanks.

On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:

I'm looking at using Elasticsearch for our application. We do
quite a lot of filtering on numerical attributes (for example the items
have a category_id and feature_ids properties and a lot of searches will be
filters requiring a certain category_id and a number of feature_ids).

The number of indexed documents is relatively small: 10s of
thousands of documents, possibly low 100s of thousands of documents. Each
user of the system has a certain number (10-20k) of documents recommended
to them with a score given by a recommendation algorithm I'm implemented.
The ordering is different for each user. What I'd like to be able to do is
present results ordered by that score but using Elasticsearch to filter by
category, date, feature_ids and so on.
Searches will always have at least numerical filter that should
limit the maximum number of hits to 10-20k, and more usually <10k

I've tried come up with 3 approaches so far:

  • calculating the recommendation score on the fly via
    custom_score. With a simplified version of the scoring algorithm this was
    taking 2-3s for a search which isn't fast enough

  • precalculating recommendation scores (which we already do), and
    passing a map of {document_id => score} as the parameter to a custom_score
    script. That script is then very simple - basically
    params['score_map'][doc['id'].stringValue]
    This does work but doesn't give me a very nice feeling. Each
    search request is pretty chunky because we have to keep sending the map od
    scores over. I naively expect that this would get faster when more
    selective filters made the result set smaller but that doesn't seem to be
    the case. On my development machine these searches seem to take ~150-200ms

  • nested documents. With this approach each document would have a
    number of nested documents of the form

{
user_id: 1234
score: 4567
}

I could then use a has_child filter to select those documents
that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how
I would sort documents by score in this strategy - would I have to use a
nested query with a customer score rather than a has_child filter?

The other wrinkle in the ointment is that users' scores can of
course change. And those changes need to be reflected quickly: if a user
edits their preferences to say that they hate the colour red but their
recommendations still show lots of red things they get frustrated. New
users also need to be able to get search results quickly. This is
straightforward with the first 2 approaches, but less so for the third I've
had a look at the api for updates but I'm sure if I could efficiently
update/remove all the nested documents for a single user and what
performance impact this might have. The documentations hints at the fact
that the nested documents are stored/indexed separately but it seems like
this is an implementation detail that I can't access directly

Has anyone else tried something similar or have some words of
wisdom on what approaches might by effective ?

Thanks,

Fred

Attachments:

  • smime.p7s

Attachments:

  • smime.p7s

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.