Hi Peter,
The rescorer uses any query so you can use all the machinery out there,
including function score. If you put your function score query under the
rescorer. It always does a weighted sum of the query score and the recorer
score, so you can tweak things to your desire. Set query_weight to 0 if you
want only the rescorer score.
The rescorer runs on every shard before the results return so no need to
include your scripts on the client nodes.
I think I understand where you're going.
Here is a trick that may speed things up:
Assuming the scores are always between 0 & 1, you can store the feature
index and score together - 1.20 would mean that index no 1 has score 0.2 .
20.44 means that index 20 has score 44. This has the upside that you can
use the field data to load this into memory and access it via the double
values. They sort in the right order. To save memory, you can go further
say you only want scores to have 2 byte (or byte) accuracy. Then you can
always store 3 bytes numbers where the 1 byte (highest order) is the index
and the 2 least significant bytes are the score.
Cheers,
Boaz
On Tue, Oct 29, 2013 at 5:52 PM, Peter Pathirana peter@vagaband.co wrote:
Hi Boaz,
Not quite sure if we can get away with the rescorer alone. Not sure
whether it allows multiple scores to be aggregated ala function score
style. Also rescoring sounds like it makes using node client tricky as that
might require our score plugin to be deployed to client nodes running node
client.
Let me tell you about our use case from a higher level. We have machine
learning processes that generate "matches" for our users or cohorts of
users. We put these lists of matches per user/per cohort into ES. We use ES
as the part of our infrastructure that serves this matches to our users in
real time. As part of serving matches, it will allow additional filtering
(maybe based on UI interaction) or sorting (i.e. Scoring by additional
measures of relevance with regards to "match"), etc.
So in this particular case, we're running an additional scoring algorithm
on vectors (fields x,y,z in the index mappings I gave) to personalize the
results to a give user. User's particular values for these vectors are
given via the query and maybe behavior driven.
We are using a Jaccard-esque format for determining the distance of values
for these vectors between documents and the user.
That might sound a bit confusing, if it does.. I can explain further.
Thanks,
Peter
On Oct 29, 2013, at 7:45 AM, Boaz Leskes b.leskes@gmail.com wrote:
Hi Peter,
Nice!
I have some ideas on how you could speed things up by using nested
documents, loading those value into memory and writing your own custom
score function (and a plugin) but that will quite a bit of work.
As an alternative you might want to consider the query rescorer (
Elasticsearch Platform — Find real-time answers at scale | Elastic). The query rescorer allows you the first quickly get the top N results
based on a lighter approximate scoring metric and the only apply the more
complex one (your script) to those top N.
Out curiosity - how are you planning to use the Jaccard score for? what
is the use case?
Cheers,
Boaz
On Thu, Oct 24, 2013 at 4:53 PM, peter@vagaband.co peter@vagaband.cowrote:
Hey Boaz,
Sorry for the delay in getting back.. was out of town.
So right now, I'm storing the keys and values in two separate fields as
strings and delimiting them with commas within the string. And within the
plugin, splitting them out. But splitting them out for every single doc
during scoring is not very performant.
Here's a gist with 3 files, current version of plugin, current index
mappings, and function score query I'm running on it.
JaccardScoreScript · GitHub
If you can suggest a better (a more performant way of either modeling the
data or writing this scoring logic), I'd be a very happy camper.
Thank you,
Peter
On Monday, October 21, 2013 10:28:08 AM UTC-4, Boaz Leskes wrote:
Hi Peter,
doc().get("field") uses the field data cache discussed before.
fields().get("field") uses lucene stored fields which are on disk and thus
cached by the file system cache (and are typically too slow for scoring).
It will sadly not support nested object as it works on the lucene document
level (and nested docsare separate lucene docs).
As far as I can tell the only way to get to the nested structures in a
script right now is using the sourcelookup which is slow. I have some ideas
about how we can potentially extend it but needs some more thinking and
time.
I was hoping you can do whatever you need with nested queries...
If that doesn't work, perhaps you can give some examples of what you
need (json + neede score) and I'll try to come up with something else.
Cheers,
Boaz
On Mon, Oct 21, 2013 at 3:10 PM, pe...@vagaband.co pe...@vagaband.cowrote:
Thanks, Boaz. That makes sense now. Nested objects seems like a
solution, but I'm not quite sure on how I might access nested objects
values from within a script scoring plugin.
There seems to be two options,
- doc().get("field")
- fields().get("field")
Both seems to use a some form of cache, but #1 only seems to support
Longs, Doubles and Strings. #2 looks like it will support complex objects
(like the one you mentioned - [{"key": "k1", "value": "v1"},{"key": "k2",
"value": "v2"}] ). So it looks like #2 is the only option here.
What's the difference between the two? #2 seems to be using
a SingleFieldsVisitor to access values while #1 uses
a IndexFieldDataService. It looks like both have some form of cache but #1
seems to have a proper field cache underneath the top level cache while #2
doesn't. So it looks like #2 is is not going to perform that well. Am I
looking at it wrong?
Thanks again for your help.
Peter
On Monday, October 21, 2013 7:41:42 AM UTC-4, Boaz Leskes wrote:
Hi Peter,
The docFieldDoubles method gets it's values from the in memory
structures of the field data cache. This is done for performance. The field
data cache is not loaded from source of the document (because this will be
slow) but from the lucene index, where the values are sorted (for lookup
speed). The get api does work based on the original document source which
is why you see those values in order (note- ES doesn't the parse the source
for the get api, it just gives you back what you've put in it).
You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as it
needs to go to disk for every document.
I'm not sure about the exact semantics of what you are trying to
achieve, but did you try looking at nested objects? those allow you to
store a list of object in a why that keeps values together, like [{ "key":
"k1" , "value" : "v1"},...] .
Cheers,
Boaz
On Saturday, October 19, 2013 5:08:05 PM UTC+2, pe...@vagaband.cowrote:
I'm storing some data in array type field which needs to be accessed
within Native Script which is used as custom scorer with function_score
query. But when I access the field values within Native Script using
docFieldDoubles I do not get the values in order. Does the array data type
not maintain ordering? When I do a GET on that doc, it does show the values
in that field in order, but not from within the Native script plugin. Is
this a bug or is it expected?
What I'm really trying to do is this. I need to maintain a Map or a
set of key/value pairs where the keys are different for each document. And
I need to access the key/value pairs using a known field name (from both
the scoring plugin as well as from search clients). Right now, I'm storing
two fields, one with keys and other with values and have both them store
these in a comma delimited form. Then from within the plugin, I split on
comma and based on position I figure out which key maps to which value.
This is of course not very performant and I'd prefer to avoid doing that.
As a first step, I tried arrays as mentioned above (instead of comma
delimited string), but that seems to lose ordering. What's the best way to
do this?
Thanks,
Peter
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/cI5im_**EYIDY/unsubscribehttps://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.