I'm looking at using elastic search for our application. We do quite a lot
of filtering on numerical attributes (for example the items have a
category_id and feature_ids properties and a lot of searches will be
filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of thousands of
documents, possibly low 100s of thousands of documents. Each user of the
system has a certain number (10-20k) of documents recommended to them with
a score given by a recommendation algorithm I'm implemented. The ordering
is different for each user. What I'd like to be able to do is present
results ordered by that score but using elastic search to filter by
category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the
maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via custom_score. With a
simplified version of the scoring algorithm this was taking 2-3s for a
search which isn't fast enough
precalculating recommendation scores (which we already do), and passing a
map of {document_id => score} as the parameter to a custom_score script.
That script is then very simple - basically
params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each search request
is pretty chunky because we have to keep sending the map od scores over. I
naively expect that this would get faster when more selective filters made
the result set smaller but that doesn't seem to be the case. On my
development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a number of
nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents that are
recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would
sort documents by score in this strategy - would I have to use a nested
query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of course
change. And those changes need to be reflected quickly: if a user edits
their preferences to say that they hate the colour red but their
recommendations still show lots of red things they get frustrated. New
users also need to be able to get search results quickly. This is
straightforward with the first 2 approaches, but less so for the third I've
had a look at the api for updates but I'm sure if I could efficiently
update/remove all the nested documents for a single user and what
performance impact this might have. The documentations hints at the fact
that the nested documents are stored/indexed separately but it seems like
this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of wisdom on
what approaches might by effective ?
Have you tried to implement the custom_score as a "native" script (using Java). It will speed things up. This would be the simplest option.
The second option, does that mean that you send a map of 20k docId->score for each search request?
Nested will probably won't help, since you can't sort on nested doc values...
On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:
I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough
precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a number of nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?
Have you tried to implement the custom_score as a "native" script (using Java). It will speed things up. This would be the simplest option.
Not yet. Will give it a go. Evaluating a score is relatively simple - it's effectively the dot product of a vector of per garment floats with a vector of per user floats (about a 100 or so). Is java likely to be much faster than mvel at this sort of task?
The second option, does that mean that you send a map of 20k docId->score for each search request?
Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Nested will probably won't help, since you can't sort on nested doc values...
Won't head any more down that dead-end then. Thanks.
On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:
I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough
precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a number of nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?
Have you tried to implement the custom_score as a "native" script (using Java). It will speed things up. This would be the simplest option.
Not yet. Will give it a go. Evaluating a score is relatively simple - it's effectively the dot product of a vector of per garment floats with a vector of per user floats (about a 100 or so). Is java likely to be much faster than mvel at this sort of task?
It will be faster, yes. How much faster depends on the type of script and calculations you do (hard to answer).
The second option, does that mean that you send a map of 20k docId->score for each search request?
Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time.
Nested will probably won't help, since you can't sort on nested doc values...
Won't head any more down that dead-end then. Thanks.
On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:
I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough
precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a number of nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?
The second option, does that mean that you send a map of 20k docId->score for each search request?
Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time
I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).
Thanks.
Fred
.
Nested will probably won't help, since you can't sort on nested doc values...
Won't head any more down that dead-end then. Thanks.
On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:
I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough
precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a number of nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?
Can you share the code you did so other users will benefit from it? It would be a great sample.
On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:
On 7 Mar 2012, at 11:24, Shay Banon wrote:
The second option, does that mean that you send a map of 20k docId->score for each search request?
Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time
I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).
Thanks.
Fred
.
Nested will probably won't help, since you can't sort on nested doc values...
Won't head any more down that dead-end then. Thanks.
On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:
I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough
precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a number of nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?
Can you share the code you did so other users will benefit from it? It would be a great sample.
On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:
On 7 Mar 2012, at 11:24, Shay Banon wrote:
The second option, does that mean that you send a map of 20k docId->score for each search request?
Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time
I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).
Thanks.
Fred
.
Nested will probably won't help, since you can't sort on nested doc values...
Won't head any more down that dead-end then. Thanks.
On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:
I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough
precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a number of nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?
The production code has some extra knobs on do with testing whether the cached data is stale with respect to what's in mongo and a few other bits and bobs but I believe the gist demonstrates the general workings. Apologies again for my rusty java!
Fred
On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:
On 7 Mar 2012, at 11:24, Shay Banon wrote:
The second option, does that mean that you send a map of 20k docId->score for each search request?
Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time
I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).
Thanks.
Fred
.
Nested will probably won't help, since you can't sort on nested doc values...
Won't head any more down that dead-end then. Thanks.
On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:
I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough
precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a number of nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?
Ahh, you query mongo right from the script if you need to populate the cache, cool!. One small optimization option, elasticsearch has trove embedded in it (native types data structures), so you can use it to create the map from int (id) -> BasicDBObject. Also, I would suggest just to use a collection that does int->int, and not store the mongo DBObject (extra memory).
Another note, you use LinkedHashMap to do the caching. Note that its not thread safe by default. You can wrap it with synchronized map, but, you can use Google Guave CacheBuilder (also embedded in elasticsearch) to build a cache that evicts based on number of entries and that is still concurrent.
On Thursday, March 8, 2012 at 12:40 AM, Frederick Cheung wrote:
On 7 Mar 2012, at 21:02, Shay Banon wrote:
Can you share the code you did so other users will benefit from it? It would be a great sample.
The production code has some extra knobs on do with testing whether the cached data is stale with respect to what's in mongo and a few other bits and bobs but I believe the gist demonstrates the general workings. Apologies again for my rusty java!
Fred
On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:
On 7 Mar 2012, at 11:24, Shay Banon wrote:
The second option, does that mean that you send a map of 20k docId->score for each search request?
Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time
I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).
Thanks.
Fred
.
Nested will probably won't help, since you can't sort on nested doc values...
Won't head any more down that dead-end then. Thanks.
On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:
I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough
precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a number of nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?
Ahh, you query mongo right from the script if you need to populate the cache, cool!. One small optimization option, elasticsearch has trove embedded in it (native types data structures), so you can use it to create the map from int (id) -> BasicDBObject. Also, I would suggest just to use a collection that does int->int, and not store the mongo DBObject (extra memory).
Another note, you use LinkedHashMap to do the caching. Note that its not thread safe by default. You can wrap it with synchronized map, but, you can use Google Guave CacheBuilder (also embedded in elasticsearch) to build a cache that evicts based on number of entries and that is still concurrent.
Thanks for the tips. Other than my ignorance of more concurrency friendly cache structures the reasons for putting a synchronised(this){} around fetching/populating of the cache is that otherwise mongo was actually getting hit once for each shard as they all rushed to try and populate the cache.
Fre
On Thursday, March 8, 2012 at 12:40 AM, Frederick Cheung wrote:
On 7 Mar 2012, at 21:02, Shay Banon wrote:
Can you share the code you did so other users will benefit from it? It would be a great sample.
The production code has some extra knobs on do with testing whether the cached data is stale with respect to what's in mongo and a few other bits and bobs but I believe the gist demonstrates the general workings. Apologies again for my rusty java!
Fred
On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:
On 7 Mar 2012, at 11:24, Shay Banon wrote:
The second option, does that mean that you send a map of 20k docId->score for each search request?
Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time
I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).
Thanks.
Fred
.
Nested will probably won't help, since you can't sort on nested doc values...
Won't head any more down that dead-end then. Thanks.
On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:
I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough
precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a number of nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?
The production code has some extra knobs on do with testing whether the cached data is stale with respect to what's in mongo and a few other bits and bobs but I believe the gist demonstrates the general workings. Apologies again for my rusty java!
Fred
On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:
On 7 Mar 2012, at 11:24, Shay Banon wrote:
The second option, does that mean that you send a map of 20k docId->score for each search request?
Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time
I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).
Thanks.
Fred
.
Nested will probably won't help, since you can't sort on nested doc values...
Won't head any more down that dead-end then. Thanks.
On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:
I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough
precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a number of nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?
Implemented both of your tips this morning. Changing to an int->int map has probably shaved off a few percent and the google guava cache is way nicer than my botch job and handles all my needs (including refreshes - I do optimistic refreshes by querying mongo for recommendation_score_maps.find(profile_id: 1234, updated_at: {$gt: last_updated_at) - if I get no records back then I know my cached data is still fresh and don't need to rebuild the int->int map)
Thanks for the tips!
Fred
On Thursday, March 8, 2012 at 12:40 AM, Frederick Cheung wrote:
On 7 Mar 2012, at 21:02, Shay Banon wrote:
Can you share the code you did so other users will benefit from it? It would be a great sample.
The production code has some extra knobs on do with testing whether the cached data is stale with respect to what's in mongo and a few other bits and bobs but I believe the gist demonstrates the general workings. Apologies again for my rusty java!
Fred
On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:
On 7 Mar 2012, at 11:24, Shay Banon wrote:
The second option, does that mean that you send a map of 20k docId->score for each search request?
Yes. Yuck. I can probably cut down the map size based on filters in some cases but not all. The data for each user's map is stored as a single mongo document so I might try loading the map inside a native script rather than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those parameters :). With the Java implementation, you could have the script cache the mentioned map (on the factory level, which is singleton in the node level). You could potentially build something that throws a failure if the map for a specific user is not there, and you can catch that failure (with a specific message), and then know that you need to send the map to reinitialize the cache. That way, you don't have to send the map each time
I've been going down that route and it is a lot, lot faster than passing the parameters each time - it didn't occur to me that so much of the elapsed time was spent handling the parameters. With a warm cache (which should occur relatively often - users very rarely view just one page) I'm getting results in around 15ms which is more than fast enough (and way faster & more flexible than our previous home grown non Elasticsearch solution).
Thanks.
Fred
.
Nested will probably won't help, since you can't sort on nested doc values...
Won't head any more down that dead-end then. Thanks.
On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:
I'm looking at using Elasticsearch for our application. We do quite a lot of filtering on numerical attributes (for example the items have a category_id and feature_ids properties and a lot of searches will be filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of thousands of documents, possibly low 100s of thousands of documents. Each user of the system has a certain number (10-20k) of documents recommended to them with a score given by a recommendation algorithm I'm implemented. The ordering is different for each user. What I'd like to be able to do is present results ordered by that score but using Elasticsearch to filter by category, date, feature_ids and so on.
Searches will always have at least numerical filter that should limit the maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via custom_score. With a simplified version of the scoring algorithm this was taking 2-3s for a search which isn't fast enough
precalculating recommendation scores (which we already do), and passing a map of {document_id => score} as the parameter to a custom_score script. That script is then very simple - basically params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each search request is pretty chunky because we have to keep sending the map od scores over. I naively expect that this would get faster when more selective filters made the result set smaller but that doesn't seem to be the case. On my development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a number of nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how I would sort documents by score in this strategy - would I have to use a nested query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of course change. And those changes need to be reflected quickly: if a user edits their preferences to say that they hate the colour red but their recommendations still show lots of red things they get frustrated. New users also need to be able to get search results quickly. This is straightforward with the first 2 approaches, but less so for the third I've had a look at the api for updates but I'm sure if I could efficiently update/remove all the nested documents for a single user and what performance impact this might have. The documentations hints at the fact that the nested documents are stored/indexed separately but it seems like this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of wisdom on what approaches might by effective ?
I have a setup, where I require a per-user ordering much like Frederick
describes. We have a score-table in MySQL, with a score per document per
user, and we would like to sort search results based on the specific user's
scores.
As an alternative to the third option Frederick describes (nested
documents), we've considered adding a field per user to each document with
the name "score{user_id}", e.g. "_score_5327" for the user with id 5327.
Quering elasticsearch on behalf of that user, then requires specifying
"sort": { "_score_5327": { "order": "desc" } }. By keeping the per-user
score on the root document, we sidestep the problem of not being able to
sort on nested document fields. We can keep the scores up-to-date with the
partial update API
(Elasticsearch Platform — Find real-time answers at scale | Elastic).
Are there any apparent problems with this approach? Can ES handle the many
extra fields (~3,000 scores for the most popular document)? How does it
compare to writing a plugin, like suggested above?
Thanks,
Andreas
On Friday, 9 March 2012 11:33:59 UTC+1, Frederick Cheung wrote:
Implemented both of your tips this morning. Changing to an int->int map
has probably shaved off a few percent and the google guava cache is way
nicer than my botch job and handles all my needs (including refreshes - I
do optimistic refreshes by querying mongo for
recommendation_score_maps.find(profile_id: 1234, updated_at: {$gt:
last_updated_at) - if I get no records back then I know my cached data is
still fresh and don't need to rebuild the int->int map)
Thanks for the tips!
Fred
On Thursday, March 8, 2012 at 12:40 AM, Frederick Cheung wrote:
On 7 Mar 2012, at 21:02, Shay Banon wrote:
Can you share the code you did so other users will benefit from it? It
would be a great sample.
The production code has some extra knobs on do with testing whether the
cached data is stale with respect to what's in mongo and a few other bits
and bobs but I believe the gist demonstrates the general workings.
Apologies again for my rusty java!
Fred
On Wednesday, March 7, 2012 at 5:49 PM, Frederick Cheung wrote:
On 7 Mar 2012, at 11:24, Shay Banon wrote:
The second option, does that mean that you send a map of 20k
docId->score for each search request?
Yes. Yuck. I can probably cut down the map size based on filters in
some cases but not all. The data for each user's map is stored as a single
mongo document so I might try loading the map inside a native script rather
than loading it in my app and shuttling it across to Elasticsearch.
Still nice that it takes 150-200ms when you send all those
parameters :). With the Java implementation, you could have the script
cache the mentioned map (on the factory level, which is singleton in the
node level). You could potentially build something that throws a failure if
the map for a specific user is not there, and you can catch that failure
(with a specific message), and then know that you need to send the map to
reinitialize the cache. That way, you don't have to send the map each time
I've been going down that route and it is a lot, lot faster than
passing the parameters each time - it didn't occur to me that so much of
the elapsed time was spent handling the parameters. With a warm cache
(which should occur relatively often - users very rarely view just one
page) I'm getting results in around 15ms which is more than fast enough
(and way faster & more flexible than our previous home grown non elastic
search solution).
Thanks.
Fred
.
Nested will probably won't help, since you can't sort on nested
doc values...
Won't head any more down that dead-end then. Thanks.
On Tuesday, March 6, 2012 at 1:37 PM, Frederick Cheung wrote:
I'm looking at using Elasticsearch for our application. We do
quite a lot of filtering on numerical attributes (for example the items
have a category_id and feature_ids properties and a lot of searches will be
filters requiring a certain category_id and a number of feature_ids).
The number of indexed documents is relatively small: 10s of
thousands of documents, possibly low 100s of thousands of documents. Each
user of the system has a certain number (10-20k) of documents recommended
to them with a score given by a recommendation algorithm I'm implemented.
The ordering is different for each user. What I'd like to be able to do is
present results ordered by that score but using Elasticsearch to filter by
category, date, feature_ids and so on.
Searches will always have at least numerical filter that should
limit the maximum number of hits to 10-20k, and more usually <10k
I've tried come up with 3 approaches so far:
calculating the recommendation score on the fly via
custom_score. With a simplified version of the scoring algorithm this was
taking 2-3s for a search which isn't fast enough
precalculating recommendation scores (which we already do), and
passing a map of {document_id => score} as the parameter to a custom_score
script. That script is then very simple - basically
params['score_map'][doc['id'].stringValue]
This does work but doesn't give me a very nice feeling. Each
search request is pretty chunky because we have to keep sending the map od
scores over. I naively expect that this would get faster when more
selective filters made the result set smaller but that doesn't seem to be
the case. On my development machine these searches seem to take ~150-200ms
nested documents. With this approach each document would have a
number of nested documents of the form
{
user_id: 1234
score: 4567
}
I could then use a has_child filter to select those documents
that are recommended for the user
I haven't yet implemented this strategy. I'm not sure however how
I would sort documents by score in this strategy - would I have to use a
nested query with a customer score rather than a has_child filter?
The other wrinkle in the ointment is that users' scores can of
course change. And those changes need to be reflected quickly: if a user
edits their preferences to say that they hate the colour red but their
recommendations still show lots of red things they get frustrated. New
users also need to be able to get search results quickly. This is
straightforward with the first 2 approaches, but less so for the third I've
had a look at the api for updates but I'm sure if I could efficiently
update/remove all the nested documents for a single user and what
performance impact this might have. The documentations hints at the fact
that the nested documents are stored/indexed separately but it seems like
this is an implementation detail that I can't access directly
Has anyone else tried something similar or have some words of
wisdom on what approaches might by effective ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.