Accessing array field within Native Plugin


(peter@vagaband.co) #1

I'm storing some data in array type field which needs to be accessed within
Native Script which is used as custom scorer with function_score query. But
when I access the field values within Native Script using docFieldDoubles I
do not get the values in order. Does the array data type not maintain
ordering? When I do a GET on that doc, it does show the values in that
field in order, but not from within the Native script plugin. Is this a bug
or is it expected?

What I'm really trying to do is this. I need to maintain a Map or a set of
key/value pairs where the keys are different for each document. And I need
to access the key/value pairs using a known field name (from both the
scoring plugin as well as from search clients). Right now, I'm storing two
fields, one with keys and other with values and have both them store these
in a comma delimited form. Then from within the plugin, I split on comma
and based on position I figure out which key maps to which value. This is
of course not very performant and I'd prefer to avoid doing that. As a
first step, I tried arrays as mentioned above (instead of comma delimited
string), but that seems to lose ordering. What's the best way to do this?

Thanks,
Peter

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Boaz Leskes) #2

Hi Peter,

The docFieldDoubles method gets it's values from the in memory structures
of the field data cache. This is done for performance. The field data cache
is not loaded from source of the document (because this will be slow) but
from the lucene index, where the values are sorted (for lookup speed). The
get api does work based on the original document source which is why you
see those values in order (note- ES doesn't the parse the source for the
get api, it just gives you back what you've put in it).

You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as it
needs to go to disk for every document.

I'm not sure about the exact semantics of what you are trying to achieve,
but did you try looking at nested objects? those allow you to store a list
of object in a why that keeps values together, like [{ "key": "k1" ,
"value" : "v1"},...] .

Cheers,
Boaz

On Saturday, October 19, 2013 5:08:05 PM UTC+2, pe...@vagaband.co wrote:

I'm storing some data in array type field which needs to be accessed
within Native Script which is used as custom scorer with function_score
query. But when I access the field values within Native Script using
docFieldDoubles I do not get the values in order. Does the array data type
not maintain ordering? When I do a GET on that doc, it does show the values
in that field in order, but not from within the Native script plugin. Is
this a bug or is it expected?

What I'm really trying to do is this. I need to maintain a Map or a set of
key/value pairs where the keys are different for each document. And I need
to access the key/value pairs using a known field name (from both the
scoring plugin as well as from search clients). Right now, I'm storing two
fields, one with keys and other with values and have both them store these
in a comma delimited form. Then from within the plugin, I split on comma
and based on position I figure out which key maps to which value. This is
of course not very performant and I'd prefer to avoid doing that. As a
first step, I tried arrays as mentioned above (instead of comma delimited
string), but that seems to lose ordering. What's the best way to do this?

Thanks,
Peter

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(peter@vagaband.co) #3

Thanks, Boaz. That makes sense now. Nested objects seems like a solution,
but I'm not quite sure on how I might access nested objects values from
within a script scoring plugin.

There seems to be two options,

  1. doc().get("field")
  2. fields().get("field")

Both seems to use a some form of cache, but #1 only seems to support Longs,
Doubles and Strings. #2 looks like it will support complex objects (like
the one you mentioned - [{"key": "k1", "value": "v1"},{"key": "k2",
"value": "v2"}] ). So it looks like #2 is the only option here.

What's the difference between the two? #2 seems to be using
a SingleFieldsVisitor to access values while #1 uses
a IndexFieldDataService. It looks like both have some form of cache but #1
seems to have a proper field cache underneath the top level cache while #2
doesn't. So it looks like #2 is is not going to perform that well. Am I
looking at it wrong?

Thanks again for your help.
Peter

On Monday, October 21, 2013 7:41:42 AM UTC-4, Boaz Leskes wrote:

Hi Peter,

The docFieldDoubles method gets it's values from the in memory structures
of the field data cache. This is done for performance. The field data cache
is not loaded from source of the document (because this will be slow) but
from the lucene index, where the values are sorted (for lookup speed). The
get api does work based on the original document source which is why you
see those values in order (note- ES doesn't the parse the source for the
get api, it just gives you back what you've put in it).

You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as it
needs to go to disk for every document.

I'm not sure about the exact semantics of what you are trying to achieve,
but did you try looking at nested objects? those allow you to store a list
of object in a why that keeps values together, like [{ "key": "k1" ,
"value" : "v1"},...] .

Cheers,
Boaz

On Saturday, October 19, 2013 5:08:05 PM UTC+2, pe...@vagaband.co wrote:

I'm storing some data in array type field which needs to be accessed
within Native Script which is used as custom scorer with function_score
query. But when I access the field values within Native Script using
docFieldDoubles I do not get the values in order. Does the array data type
not maintain ordering? When I do a GET on that doc, it does show the values
in that field in order, but not from within the Native script plugin. Is
this a bug or is it expected?

What I'm really trying to do is this. I need to maintain a Map or a set
of key/value pairs where the keys are different for each document. And I
need to access the key/value pairs using a known field name (from both the
scoring plugin as well as from search clients). Right now, I'm storing two
fields, one with keys and other with values and have both them store these
in a comma delimited form. Then from within the plugin, I split on comma
and based on position I figure out which key maps to which value. This is
of course not very performant and I'd prefer to avoid doing that. As a
first step, I tried arrays as mentioned above (instead of comma delimited
string), but that seems to lose ordering. What's the best way to do this?

Thanks,
Peter

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Boaz Leskes) #4

Hi Peter,

doc().get("field") uses the field data cache discussed before.
fields().get("field") uses lucene stored fields which are on disk and thus
cached by the file system cache (and are typically too slow for scoring).
It will sadly not support nested object as it works on the lucene document
level (and nested docsare separate lucene docs).

As far as I can tell the only way to get to the nested structures in a
script right now is using the sourcelookup which is slow. I have some ideas
about how we can potentially extend it but needs some more thinking and
time.

I was hoping you can do whatever you need with nested queries...

If that doesn't work, perhaps you can give some examples of what you need
(json + neede score) and I'll try to come up with something else.

Cheers,
Boaz

On Mon, Oct 21, 2013 at 3:10 PM, peter@vagaband.co peter@vagaband.cowrote:

Thanks, Boaz. That makes sense now. Nested objects seems like a solution,
but I'm not quite sure on how I might access nested objects values from
within a script scoring plugin.

There seems to be two options,

  1. doc().get("field")
  2. fields().get("field")

Both seems to use a some form of cache, but #1 only seems to support
Longs, Doubles and Strings. #2 looks like it will support complex objects
(like the one you mentioned - [{"key": "k1", "value": "v1"},{"key": "k2",
"value": "v2"}] ). So it looks like #2 is the only option here.

What's the difference between the two? #2 seems to be using
a SingleFieldsVisitor to access values while #1 uses
a IndexFieldDataService. It looks like both have some form of cache but #1
seems to have a proper field cache underneath the top level cache while #2
doesn't. So it looks like #2 is is not going to perform that well. Am I
looking at it wrong?

Thanks again for your help.
Peter

On Monday, October 21, 2013 7:41:42 AM UTC-4, Boaz Leskes wrote:

Hi Peter,

The docFieldDoubles method gets it's values from the in memory structures
of the field data cache. This is done for performance. The field data cache
is not loaded from source of the document (because this will be slow) but
from the lucene index, where the values are sorted (for lookup speed). The
get api does work based on the original document source which is why you
see those values in order (note- ES doesn't the parse the source for the
get api, it just gives you back what you've put in it).

You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as it
needs to go to disk for every document.

I'm not sure about the exact semantics of what you are trying to achieve,
but did you try looking at nested objects? those allow you to store a list
of object in a why that keeps values together, like [{ "key": "k1" ,
"value" : "v1"},...] .

Cheers,
Boaz

On Saturday, October 19, 2013 5:08:05 PM UTC+2, pe...@vagaband.co wrote:

I'm storing some data in array type field which needs to be accessed
within Native Script which is used as custom scorer with function_score
query. But when I access the field values within Native Script using
docFieldDoubles I do not get the values in order. Does the array data type
not maintain ordering? When I do a GET on that doc, it does show the values
in that field in order, but not from within the Native script plugin. Is
this a bug or is it expected?

What I'm really trying to do is this. I need to maintain a Map or a set
of key/value pairs where the keys are different for each document. And I
need to access the key/value pairs using a known field name (from both the
scoring plugin as well as from search clients). Right now, I'm storing two
fields, one with keys and other with values and have both them store these
in a comma delimited form. Then from within the plugin, I split on comma
and based on position I figure out which key maps to which value. This is
of course not very performant and I'd prefer to avoid doing that. As a
first step, I tried arrays as mentioned above (instead of comma delimited
string), but that seems to lose ordering. What's the best way to do this?

Thanks,
Peter

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(peter@vagaband.co) #5

Hey Boaz,

Sorry for the delay in getting back.. was out of town.

So right now, I'm storing the keys and values in two separate fields as
strings and delimiting them with commas within the string. And within the
plugin, splitting them out. But splitting them out for every single doc
during scoring is not very performant.

Here's a gist with 3 files, current version of plugin, current index
mappings, and function score query I'm running on it.

If you can suggest a better (a more performant way of either modeling the
data or writing this scoring logic), I'd be a very happy camper.

Thank you,
Peter

On Monday, October 21, 2013 10:28:08 AM UTC-4, Boaz Leskes wrote:

Hi Peter,

doc().get("field") uses the field data cache discussed before.
fields().get("field") uses lucene stored fields which are on disk and thus
cached by the file system cache (and are typically too slow for scoring).
It will sadly not support nested object as it works on the lucene document
level (and nested docsare separate lucene docs).

As far as I can tell the only way to get to the nested structures in a
script right now is using the sourcelookup which is slow. I have some ideas
about how we can potentially extend it but needs some more thinking and
time.

I was hoping you can do whatever you need with nested queries...

If that doesn't work, perhaps you can give some examples of what you need
(json + neede score) and I'll try to come up with something else.

Cheers,
Boaz

On Mon, Oct 21, 2013 at 3:10 PM, pe...@vagaband.co <javascript:> <
pe...@vagaband.co <javascript:>> wrote:

Thanks, Boaz. That makes sense now. Nested objects seems like a solution,
but I'm not quite sure on how I might access nested objects values from
within a script scoring plugin.

There seems to be two options,

  1. doc().get("field")
  2. fields().get("field")

Both seems to use a some form of cache, but #1 only seems to support
Longs, Doubles and Strings. #2 looks like it will support complex objects
(like the one you mentioned - [{"key": "k1", "value": "v1"},{"key": "k2",
"value": "v2"}] ). So it looks like #2 is the only option here.

What's the difference between the two? #2 seems to be using
a SingleFieldsVisitor to access values while #1 uses
a IndexFieldDataService. It looks like both have some form of cache but #1
seems to have a proper field cache underneath the top level cache while #2
doesn't. So it looks like #2 is is not going to perform that well. Am I
looking at it wrong?

Thanks again for your help.
Peter

On Monday, October 21, 2013 7:41:42 AM UTC-4, Boaz Leskes wrote:

Hi Peter,

The docFieldDoubles method gets it's values from the in memory
structures of the field data cache. This is done for performance. The field
data cache is not loaded from source of the document (because this will be
slow) but from the lucene index, where the values are sorted (for lookup
speed). The get api does work based on the original document source which
is why you see those values in order (note- ES doesn't the parse the source
for the get api, it just gives you back what you've put in it).

You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as it
needs to go to disk for every document.

I'm not sure about the exact semantics of what you are trying to
achieve, but did you try looking at nested objects? those allow you to
store a list of object in a why that keeps values together, like [{ "key":
"k1" , "value" : "v1"},...] .

Cheers,
Boaz

On Saturday, October 19, 2013 5:08:05 PM UTC+2, pe...@vagaband.co wrote:

I'm storing some data in array type field which needs to be accessed
within Native Script which is used as custom scorer with function_score
query. But when I access the field values within Native Script using
docFieldDoubles I do not get the values in order. Does the array data type
not maintain ordering? When I do a GET on that doc, it does show the values
in that field in order, but not from within the Native script plugin. Is
this a bug or is it expected?

What I'm really trying to do is this. I need to maintain a Map or a set
of key/value pairs where the keys are different for each document. And I
need to access the key/value pairs using a known field name (from both the
scoring plugin as well as from search clients). Right now, I'm storing two
fields, one with keys and other with values and have both them store these
in a comma delimited form. Then from within the plugin, I split on comma
and based on position I figure out which key maps to which value. This is
of course not very performant and I'd prefer to avoid doing that. As a
first step, I tried arrays as mentioned above (instead of comma delimited
string), but that seems to lose ordering. What's the best way to do this?

Thanks,
Peter

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Boaz Leskes) #6

Hi Peter,

Nice!

I have some ideas on how you could speed things up by using nested
documents, loading those value into memory and writing your own custom
score function (and a plugin) but that will quite a bit of work.

As an alternative you might want to consider the query rescorer (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-rescore.html#_query_rescorer).
The query rescorer allows you the first quickly get the top N results
based on a lighter approximate scoring metric and the only apply the more
complex one (your script) to those top N.

Out curiosity - how are you planning to use the Jaccard score for? what is
the use case?

Cheers,
Boaz

On Thu, Oct 24, 2013 at 4:53 PM, peter@vagaband.co peter@vagaband.cowrote:

Hey Boaz,

Sorry for the delay in getting back.. was out of town.

So right now, I'm storing the keys and values in two separate fields as
strings and delimiting them with commas within the string. And within the
plugin, splitting them out. But splitting them out for every single doc
during scoring is not very performant.

Here's a gist with 3 files, current version of plugin, current index
mappings, and function score query I'm running on it.
https://gist.github.com/ppat/7138638

If you can suggest a better (a more performant way of either modeling the
data or writing this scoring logic), I'd be a very happy camper.

Thank you,
Peter

On Monday, October 21, 2013 10:28:08 AM UTC-4, Boaz Leskes wrote:

Hi Peter,

doc().get("field") uses the field data cache discussed before.
fields().get("field") uses lucene stored fields which are on disk and thus
cached by the file system cache (and are typically too slow for scoring).
It will sadly not support nested object as it works on the lucene document
level (and nested docsare separate lucene docs).

As far as I can tell the only way to get to the nested structures in a
script right now is using the sourcelookup which is slow. I have some ideas
about how we can potentially extend it but needs some more thinking and
time.

I was hoping you can do whatever you need with nested queries...

If that doesn't work, perhaps you can give some examples of what you need
(json + neede score) and I'll try to come up with something else.

Cheers,
Boaz

On Mon, Oct 21, 2013 at 3:10 PM, pe...@vagaband.co pe...@vagaband.cowrote:

Thanks, Boaz. That makes sense now. Nested objects seems like a
solution, but I'm not quite sure on how I might access nested objects
values from within a script scoring plugin.

There seems to be two options,

  1. doc().get("field")
  2. fields().get("field")

Both seems to use a some form of cache, but #1 only seems to support
Longs, Doubles and Strings. #2 looks like it will support complex objects
(like the one you mentioned - [{"key": "k1", "value": "v1"},{"key": "k2",
"value": "v2"}] ). So it looks like #2 is the only option here.

What's the difference between the two? #2 seems to be using
a SingleFieldsVisitor to access values while #1 uses
a IndexFieldDataService. It looks like both have some form of cache but #1
seems to have a proper field cache underneath the top level cache while #2
doesn't. So it looks like #2 is is not going to perform that well. Am I
looking at it wrong?

Thanks again for your help.
Peter

On Monday, October 21, 2013 7:41:42 AM UTC-4, Boaz Leskes wrote:

Hi Peter,

The docFieldDoubles method gets it's values from the in memory
structures of the field data cache. This is done for performance. The field
data cache is not loaded from source of the document (because this will be
slow) but from the lucene index, where the values are sorted (for lookup
speed). The get api does work based on the original document source which
is why you see those values in order (note- ES doesn't the parse the source
for the get api, it just gives you back what you've put in it).

You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as it
needs to go to disk for every document.

I'm not sure about the exact semantics of what you are trying to
achieve, but did you try looking at nested objects? those allow you to
store a list of object in a why that keeps values together, like [{ "key":
"k1" , "value" : "v1"},...] .

Cheers,
Boaz

On Saturday, October 19, 2013 5:08:05 PM UTC+2, pe...@vagaband.cowrote:

I'm storing some data in array type field which needs to be accessed
within Native Script which is used as custom scorer with function_score
query. But when I access the field values within Native Script using
docFieldDoubles I do not get the values in order. Does the array data type
not maintain ordering? When I do a GET on that doc, it does show the values
in that field in order, but not from within the Native script plugin. Is
this a bug or is it expected?

What I'm really trying to do is this. I need to maintain a Map or a
set of key/value pairs where the keys are different for each document. And
I need to access the key/value pairs using a known field name (from both
the scoring plugin as well as from search clients). Right now, I'm storing
two fields, one with keys and other with values and have both them store
these in a comma delimited form. Then from within the plugin, I split on
comma and based on position I figure out which key maps to which value.
This is of course not very performant and I'd prefer to avoid doing that.
As a first step, I tried arrays as mentioned above (instead of comma
delimited string), but that seems to lose ordering. What's the best way to
do this?

Thanks,
Peter

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/cI5im_**EYIDY/unsubscribehttps://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(peter@vagaband.co) #7

Hi Boaz,

Not quite sure if we can get away with the rescorer alone. Not sure whether it allows multiple scores to be aggregated ala function score style. Also rescoring sounds like it makes using node client tricky as that might require our score plugin to be deployed to client nodes running node client.

Let me tell you about our use case from a higher level. We have machine learning processes that generate "matches" for our users or cohorts of users. We put these lists of matches per user/per cohort into ES. We use ES as the part of our infrastructure that serves this matches to our users in real time. As part of serving matches, it will allow additional filtering (maybe based on UI interaction) or sorting (i.e. Scoring by additional measures of relevance with regards to "match"), etc.

So in this particular case, we're running an additional scoring algorithm on vectors (fields x,y,z in the index mappings I gave) to personalize the results to a give user. User's particular values for these vectors are given via the query and maybe behavior driven.

We are using a Jaccard-esque format for determining the distance of values for these vectors between documents and the user.

That might sound a bit confusing, if it does.. I can explain further.

Thanks,
Peter

On Oct 29, 2013, at 7:45 AM, Boaz Leskes b.leskes@gmail.com wrote:

Hi Peter,

Nice!

I have some ideas on how you could speed things up by using nested documents, loading those value into memory and writing your own custom score function (and a plugin) but that will quite a bit of work.

As an alternative you might want to consider the query rescorer ( http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-rescore.html#_query_rescorer ). The query rescorer allows you the first quickly get the top N results based on a lighter approximate scoring metric and the only apply the more complex one (your script) to those top N.

Out curiosity - how are you planning to use the Jaccard score for? what is the use case?

Cheers,
Boaz

On Thu, Oct 24, 2013 at 4:53 PM, peter@vagaband.co peter@vagaband.co wrote:
Hey Boaz,

Sorry for the delay in getting back.. was out of town.

So right now, I'm storing the keys and values in two separate fields as strings and delimiting them with commas within the string. And within the plugin, splitting them out. But splitting them out for every single doc during scoring is not very performant.

Here's a gist with 3 files, current version of plugin, current index mappings, and function score query I'm running on it.
https://gist.github.com/ppat/7138638

If you can suggest a better (a more performant way of either modeling the data or writing this scoring logic), I'd be a very happy camper.

Thank you,
Peter

On Monday, October 21, 2013 10:28:08 AM UTC-4, Boaz Leskes wrote:
Hi Peter,

doc().get("field") uses the field data cache discussed before. fields().get("field") uses lucene stored fields which are on disk and thus cached by the file system cache (and are typically too slow for scoring). It will sadly not support nested object as it works on the lucene document level (and nested docsare separate lucene docs).

As far as I can tell the only way to get to the nested structures in a script right now is using the sourcelookup which is slow. I have some ideas about how we can potentially extend it but needs some more thinking and time.

I was hoping you can do whatever you need with nested queries...

If that doesn't work, perhaps you can give some examples of what you need (json + neede score) and I'll try to come up with something else.

Cheers,
Boaz

On Mon, Oct 21, 2013 at 3:10 PM, pe...@vagaband.co pe...@vagaband.co wrote:
Thanks, Boaz. That makes sense now. Nested objects seems like a solution, but I'm not quite sure on how I might access nested objects values from within a script scoring plugin.

There seems to be two options,

  1. doc().get("field")
  2. fields().get("field")

Both seems to use a some form of cache, but #1 only seems to support Longs, Doubles and Strings. #2 looks like it will support complex objects (like the one you mentioned - [{"key": "k1", "value": "v1"},{"key": "k2", "value": "v2"}] ). So it looks like #2 is the only option here.

What's the difference between the two? #2 seems to be using a SingleFieldsVisitor to access values while #1 uses a IndexFieldDataService. It looks like both have some form of cache but #1 seems to have a proper field cache underneath the top level cache while #2 doesn't. So it looks like #2 is is not going to perform that well. Am I looking at it wrong?

Thanks again for your help.
Peter

On Monday, October 21, 2013 7:41:42 AM UTC-4, Boaz Leskes wrote:
Hi Peter,

The docFieldDoubles method gets it's values from the in memory structures of the field data cache. This is done for performance. The field data cache is not loaded from source of the document (because this will be slow) but from the lucene index, where the values are sorted (for lookup speed). The get api does work based on the original document source which is why you see those values in order (note- ES doesn't the parse the source for the get api, it just gives you back what you've put in it).

You can access the original document (which will be parsed) using the SourceLookup (available from the source method) but it will be slow as it needs to go to disk for every document.

I'm not sure about the exact semantics of what you are trying to achieve, but did you try looking at nested objects? those allow you to store a list of object in a why that keeps values together, like [{ "key": "k1" , "value" : "v1"},...] .

Cheers,
Boaz

On Saturday, October 19, 2013 5:08:05 PM UTC+2, pe...@vagaband.co wrote:
I'm storing some data in array type field which needs to be accessed within Native Script which is used as custom scorer with function_score query. But when I access the field values within Native Script using docFieldDoubles I do not get the values in order. Does the array data type not maintain ordering? When I do a GET on that doc, it does show the values in that field in order, but not from within the Native script plugin. Is this a bug or is it expected?

What I'm really trying to do is this. I need to maintain a Map or a set of key/value pairs where the keys are different for each document. And I need to access the key/value pairs using a known field name (from both the scoring plugin as well as from search clients). Right now, I'm storing two fields, one with keys and other with values and have both them store these in a comma delimited form. Then from within the plugin, I split on comma and based on position I figure out which key maps to which value. This is of course not very performant and I'd prefer to avoid doing that. As a first step, I tried arrays as mentioned above (instead of comma delimited string), but that seems to lose ordering. What's the best way to do this?

Thanks,
Peter

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Boaz Leskes) #8

Hi Peter,

The rescorer uses any query so you can use all the machinery out there,
including function score. If you put your function score query under the
rescorer. It always does a weighted sum of the query score and the recorer
score, so you can tweak things to your desire. Set query_weight to 0 if you
want only the rescorer score.

The rescorer runs on every shard before the results return so no need to
include your scripts on the client nodes.

I think I understand where you're going.

Here is a trick that may speed things up:

Assuming the scores are always between 0 & 1, you can store the feature
index and score together - 1.20 would mean that index no 1 has score 0.2 .
20.44 means that index 20 has score 44. This has the upside that you can
use the field data to load this into memory and access it via the double
values. They sort in the right order. To save memory, you can go further
say you only want scores to have 2 byte (or byte) accuracy. Then you can
always store 3 bytes numbers where the 1 byte (highest order) is the index
and the 2 least significant bytes are the score.

Cheers,
Boaz

On Tue, Oct 29, 2013 at 5:52 PM, Peter Pathirana peter@vagaband.co wrote:

Hi Boaz,

Not quite sure if we can get away with the rescorer alone. Not sure
whether it allows multiple scores to be aggregated ala function score
style. Also rescoring sounds like it makes using node client tricky as that
might require our score plugin to be deployed to client nodes running node
client.

Let me tell you about our use case from a higher level. We have machine
learning processes that generate "matches" for our users or cohorts of
users. We put these lists of matches per user/per cohort into ES. We use ES
as the part of our infrastructure that serves this matches to our users in
real time. As part of serving matches, it will allow additional filtering
(maybe based on UI interaction) or sorting (i.e. Scoring by additional
measures of relevance with regards to "match"), etc.

So in this particular case, we're running an additional scoring algorithm
on vectors (fields x,y,z in the index mappings I gave) to personalize the
results to a give user. User's particular values for these vectors are
given via the query and maybe behavior driven.

We are using a Jaccard-esque format for determining the distance of values
for these vectors between documents and the user.

That might sound a bit confusing, if it does.. I can explain further.

Thanks,
Peter

On Oct 29, 2013, at 7:45 AM, Boaz Leskes b.leskes@gmail.com wrote:

Hi Peter,

Nice!

I have some ideas on how you could speed things up by using nested
documents, loading those value into memory and writing your own custom
score function (and a plugin) but that will quite a bit of work.

As an alternative you might want to consider the query rescorer (
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-rescore.html#_query_rescorer). The query rescorer allows you the first quickly get the top N results
based on a lighter approximate scoring metric and the only apply the more
complex one (your script) to those top N.

Out curiosity - how are you planning to use the Jaccard score for? what
is the use case?

Cheers,
Boaz

On Thu, Oct 24, 2013 at 4:53 PM, peter@vagaband.co peter@vagaband.cowrote:

Hey Boaz,

Sorry for the delay in getting back.. was out of town.

So right now, I'm storing the keys and values in two separate fields as
strings and delimiting them with commas within the string. And within the
plugin, splitting them out. But splitting them out for every single doc
during scoring is not very performant.

Here's a gist with 3 files, current version of plugin, current index
mappings, and function score query I'm running on it.
https://gist.github.com/ppat/7138638

If you can suggest a better (a more performant way of either modeling the
data or writing this scoring logic), I'd be a very happy camper.

Thank you,
Peter

On Monday, October 21, 2013 10:28:08 AM UTC-4, Boaz Leskes wrote:

Hi Peter,

doc().get("field") uses the field data cache discussed before.
fields().get("field") uses lucene stored fields which are on disk and thus
cached by the file system cache (and are typically too slow for scoring).
It will sadly not support nested object as it works on the lucene document
level (and nested docsare separate lucene docs).

As far as I can tell the only way to get to the nested structures in a
script right now is using the sourcelookup which is slow. I have some ideas
about how we can potentially extend it but needs some more thinking and
time.

I was hoping you can do whatever you need with nested queries...

If that doesn't work, perhaps you can give some examples of what you
need (json + neede score) and I'll try to come up with something else.

Cheers,
Boaz

On Mon, Oct 21, 2013 at 3:10 PM, pe...@vagaband.co pe...@vagaband.cowrote:

Thanks, Boaz. That makes sense now. Nested objects seems like a
solution, but I'm not quite sure on how I might access nested objects
values from within a script scoring plugin.

There seems to be two options,

  1. doc().get("field")
  2. fields().get("field")

Both seems to use a some form of cache, but #1 only seems to support
Longs, Doubles and Strings. #2 looks like it will support complex objects
(like the one you mentioned - [{"key": "k1", "value": "v1"},{"key": "k2",
"value": "v2"}] ). So it looks like #2 is the only option here.

What's the difference between the two? #2 seems to be using
a SingleFieldsVisitor to access values while #1 uses
a IndexFieldDataService. It looks like both have some form of cache but #1
seems to have a proper field cache underneath the top level cache while #2
doesn't. So it looks like #2 is is not going to perform that well. Am I
looking at it wrong?

Thanks again for your help.
Peter

On Monday, October 21, 2013 7:41:42 AM UTC-4, Boaz Leskes wrote:

Hi Peter,

The docFieldDoubles method gets it's values from the in memory
structures of the field data cache. This is done for performance. The field
data cache is not loaded from source of the document (because this will be
slow) but from the lucene index, where the values are sorted (for lookup
speed). The get api does work based on the original document source which
is why you see those values in order (note- ES doesn't the parse the source
for the get api, it just gives you back what you've put in it).

You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as it
needs to go to disk for every document.

I'm not sure about the exact semantics of what you are trying to
achieve, but did you try looking at nested objects? those allow you to
store a list of object in a why that keeps values together, like [{ "key":
"k1" , "value" : "v1"},...] .

Cheers,
Boaz

On Saturday, October 19, 2013 5:08:05 PM UTC+2, pe...@vagaband.cowrote:

I'm storing some data in array type field which needs to be accessed
within Native Script which is used as custom scorer with function_score
query. But when I access the field values within Native Script using
docFieldDoubles I do not get the values in order. Does the array data type
not maintain ordering? When I do a GET on that doc, it does show the values
in that field in order, but not from within the Native script plugin. Is
this a bug or is it expected?

What I'm really trying to do is this. I need to maintain a Map or a
set of key/value pairs where the keys are different for each document. And
I need to access the key/value pairs using a known field name (from both
the scoring plugin as well as from search clients). Right now, I'm storing
two fields, one with keys and other with values and have both them store
these in a comma delimited form. Then from within the plugin, I split on
comma and based on position I figure out which key maps to which value.
This is of course not very performant and I'd prefer to avoid doing that.
As a first step, I tried arrays as mentioned above (instead of comma
delimited string), but that seems to lose ordering. What's the best way to
do this?

Thanks,
Peter

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/cI5im_**EYIDY/unsubscribehttps://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/cI5im_EYIDY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #9