How to calculate parent document relevance from child matches

I am trying to implement a custom relevance algorithm with ElasticSearch
using a parent/child document relationship. I understand that classic
relational database joins are not possible but there is some limited join
functionality using parent/child or nested documents. I like to calculate a
relevance score for the parent document from matching child documents using
a custom script or similar. Simplified the mapping types are defined as
follows:

{
"my_index": {
"my_item": {
"properties" : {
"url": {"type": "string", "index" : "not_analyzed"},
}
},
"relevance": {
"_parent": {"type": "my_item"},
"properties" : {
"search_term": {"type": "string", "index" : "not_analyzed"},
"score_data": {"type": "object", "index" : "no"}
}
}
}
}

Each item has a number of pre-computed relevance entities that increase its
relevance when they match one or more of the query terms. An example query
could be:

{
"query": {
"has_child":{
"type":"relevance",
"query":{
"terms":{
"search_term":["term_1", "term2", "term3"],
"minimum_match": 1
}
}
}
}
}

I would like to sort the matching parent items according to a custom
relevance formular that uses the score data in the matching child set for
every parent found.

Alternatively the children could be searched, e.g.

{
"fields" : ["_parent", "search_term", "score_data"],
"query":{
"terms": {
"search_term":["term_1", "term2", "term3"],
"minimum_match": 1
}
}
}

This requires that I could sort and return the distinct parent list using
the matching child set with my custom formular (script).

It is unclear to me whether ElasticSearch keeps parents and children on the
same shard automatically but it really seems to be the most meaningful
choice otherwise I cannot compute the parent relevance correctly. I also
have ElasticSearch generating the identifiers when the data is indexed.

I would really appreciate the help if anyone has an idea of how to solve
this relevance problem efficiently with ElasticSearch. Cheers.

--

Hi Mario

On Thu, 2013-01-17 at 12:58 -0800, Mario wrote:

I am trying to implement a custom relevance algorithm with
ElasticSearch using a parent/child document relationship. I understand
that classic relational database joins are not possible but there is
some limited join functionality using parent/child or nested
documents. I like to calculate a relevance score for the parent
document from matching child documents using a custom script or
similar.

This can be achieved both with parent/child and with nested docs - the
choice between the two depends upon other factors such as whether you
will frequently update the child docs independent of the parent.

An example query could be:

{
"query": {
"has_child":{
"type":"relevance",
"query":{
"terms":{
"search_term":["term_1", "term2", "term3"],
"minimum_match": 1
}
}
}
}
}

The has_child query is really just a filter - it returns parent_ids that
have children that match the filter.

For scoring, you should use the top_children query instead.
http://www.elasticsearch.org/guide/reference/query-dsl/top-children-query.html

It is unclear to me whether ElasticSearch keeps parents and children
on the same shard automatically but it really seems to be the most
meaningful choice otherwise I cannot compute the parent relevance
correctly. I also have ElasticSearch generating the identifiers when
the data is indexed.

Yes, parent and child are stored on the same shard. Which shard a
document is stored on depends on the _routing value. Normally, the
_routing value is calculated based on the specified or auto-generated
_id.

However with parent-child, the child uses the parent's _id value to
calculate the _routing. If you manually specify a different _routing
value for the parent doc, then you need to specify the same _routing
value for the child doc to ensure that they end up on the same shard.

clint

--

Hi Clinton,

Thanks for clarifying how parent/child relations are sharded and for the
interesting input to our relevance problem. It looks as if it is something
like this I am looking for but one thing strikes me though with the
top_children functionality. We cannot not use standard Lucene score
functionality. We want to apply our own scoring script, which needs to use
the scoring data stored in the matching children to calculate parent
relevance. It doesn't look as if this is not possible with top_children
unless I misunderstand how the scoring works with this type of query. Do
you have any suggestions how I can do this? We also really need to
recalculate all children when the parent is updated so nested documents are
maybe a better choice. Would this give us additional options to solve the
relevance problem?

Thanks
Mario

On Friday, January 18, 2013 10:21:48 AM UTC+1, Clinton Gormley wrote:

Hi Mario

On Thu, 2013-01-17 at 12:58 -0800, Mario wrote:

I am trying to implement a custom relevance algorithm with
ElasticSearch using a parent/child document relationship. I understand
that classic relational database joins are not possible but there is
some limited join functionality using parent/child or nested
documents. I like to calculate a relevance score for the parent
document from matching child documents using a custom script or
similar.

This can be achieved both with parent/child and with nested docs - the
choice between the two depends upon other factors such as whether you
will frequently update the child docs independent of the parent.

An example query could be:

{
"query": {
"has_child":{
"type":"relevance",
"query":{
"terms":{
"search_term":["term_1", "term2", "term3"],
"minimum_match": 1
}
}
}
}
}

The has_child query is really just a filter - it returns parent_ids that
have children that match the filter.

For scoring, you should use the top_children query instead.

http://www.elasticsearch.org/guide/reference/query-dsl/top-children-query.html

It is unclear to me whether ElasticSearch keeps parents and children
on the same shard automatically but it really seems to be the most
meaningful choice otherwise I cannot compute the parent relevance
correctly. I also have ElasticSearch generating the identifiers when
the data is indexed.

Yes, parent and child are stored on the same shard. Which shard a
document is stored on depends on the _routing value. Normally, the
_routing value is calculated based on the specified or auto-generated
_id.

However with parent-child, the child uses the parent's _id value to
calculate the _routing. If you manually specify a different _routing
value for the parent doc, then you need to specify the same _routing
value for the child doc to ensure that they end up on the same shard.

clint

--

On Fri, 2013-01-18 at 01:50 -0800, Mario wrote:

Hi Clinton,

Thanks for clarifying how parent/child relations are sharded and for
the interesting input to our relevance problem. It looks as if it is
something like this I am looking for but one thing strikes me though
with the top_children functionality. We cannot not use standard Lucene
score functionality. We want to apply our own scoring script, which
needs to use the scoring data stored in the matching children to
calculate parent relevance. It doesn't look as if this is not possible
with top_children unless I misunderstand how the scoring works with
this type of query. Do you have any suggestions how I can do this? We
also really need to recalculate all children when the parent is
updated so nested documents are maybe a better choice. Would this give
us additional options to solve the relevance problem?

If you need to involve all children in the calculation, then yes, it
does sound like nested docs will be a better solution.

Preferably, if you can, perform the calculation before you index, and
store it in the root object (ie the top level document)

clint

Thanks
Mario

On Friday, January 18, 2013 10:21:48 AM UTC+1, Clinton Gormley wrote:
Hi Mario

    On Thu, 2013-01-17 at 12:58 -0800, Mario wrote: 
    > I am trying to implement a custom relevance algorithm with 
    > ElasticSearch using a parent/child document relationship. I
    understand 
    > that classic relational database joins are not possible but
    there is 
    > some limited join functionality using parent/child or
    nested 
    > documents. I like to calculate a relevance score for the
    parent 
    > document from matching child documents using a custom script
    or 
    > similar. 
    
    This can be achieved both with parent/child and with nested
    docs - the 
    choice between the two depends upon other factors such as
    whether you 
    will frequently update the child docs independent of the
    parent. 
    
    >  An example query could be: 
    > 
    > 
    > { 
    > "query": { 
    >  "has_child":{ 
    >     "type":"relevance", 
    >     "query":{ 
    >         "terms":{ 
    >             "search_term":["term_1", "term2", "term3"], 
    >             "minimum_match": 1 
    >          } 
    >     } 
    >  } 
    > } 
    > } 
    
    The has_child query is really just a filter - it returns
    parent_ids that 
    have children that match the filter. 
    
    For scoring, you should use the top_children query instead. 
    http://www.elasticsearch.org/guide/reference/query-dsl/top-children-query.html 
    
    
    > It is unclear to me whether ElasticSearch keeps parents and
    children 
    > on the same shard automatically but it really seems to be
    the most 
    > meaningful choice otherwise I cannot compute the parent
    relevance 
    > correctly. I also have ElasticSearch generating the
    identifiers when 
    > the data is indexed. 
    
    Yes, parent and child are stored on the same shard.  Which
    shard a 
    document is stored on depends on the _routing value.
     Normally, the 
    _routing value is calculated based on the specified or
    auto-generated 
    _id.   
    
    However with parent-child, the child uses the parent's _id
    value to 
    calculate the _routing.  If you manually specify a different
    _routing 
    value for the parent doc, then you need to specify the same
    _routing 
    value for the child doc to ensure that they end up on the same
    shard. 
    
    clint 

--

--

Unfortunately I cannot pre-calculate everything since the score depends on
the query because only the children that match any of the terms in the
query are involved in the scoring calculation. The rest of the children are
irrelevant and there are also many of them so I like to avoid to load all
of them for the calculation.

On Friday, January 18, 2013 10:53:51 AM UTC+1, Clinton Gormley wrote:

On Fri, 2013-01-18 at 01:50 -0800, Mario wrote:

Hi Clinton,

Thanks for clarifying how parent/child relations are sharded and for
the interesting input to our relevance problem. It looks as if it is
something like this I am looking for but one thing strikes me though
with the top_children functionality. We cannot not use standard Lucene
score functionality. We want to apply our own scoring script, which
needs to use the scoring data stored in the matching children to
calculate parent relevance. It doesn't look as if this is not possible
with top_children unless I misunderstand how the scoring works with
this type of query. Do you have any suggestions how I can do this? We
also really need to recalculate all children when the parent is
updated so nested documents are maybe a better choice. Would this give
us additional options to solve the relevance problem?

If you need to involve all children in the calculation, then yes, it
does sound like nested docs will be a better solution.

Preferably, if you can, perform the calculation before you index, and
store it in the root object (ie the top level document)

clint

Thanks
Mario

On Friday, January 18, 2013 10:21:48 AM UTC+1, Clinton Gormley wrote:
Hi Mario

    On Thu, 2013-01-17 at 12:58 -0800, Mario wrote: 
    > I am trying to implement a custom relevance algorithm with 
    > ElasticSearch using a parent/child document relationship. I 
    understand 
    > that classic relational database joins are not possible but 
    there is 
    > some limited join functionality using parent/child or 
    nested 
    > documents. I like to calculate a relevance score for the 
    parent 
    > document from matching child documents using a custom script 
    or 
    > similar. 
    
    This can be achieved both with parent/child and with nested 
    docs - the 
    choice between the two depends upon other factors such as 
    whether you 
    will frequently update the child docs independent of the 
    parent. 
    
    >  An example query could be: 
    > 
    > 
    > { 
    > "query": { 
    >  "has_child":{ 
    >     "type":"relevance", 
    >     "query":{ 
    >         "terms":{ 
    >             "search_term":["term_1", "term2", "term3"], 
    >             "minimum_match": 1 
    >          } 
    >     } 
    >  } 
    > } 
    > } 
    
    The has_child query is really just a filter - it returns 
    parent_ids that 
    have children that match the filter. 
    
    For scoring, you should use the top_children query instead. 

http://www.elasticsearch.org/guide/reference/query-dsl/top-children-query.html

    > It is unclear to me whether ElasticSearch keeps parents and 
    children 
    > on the same shard automatically but it really seems to be 
    the most 
    > meaningful choice otherwise I cannot compute the parent 
    relevance 
    > correctly. I also have ElasticSearch generating the 
    identifiers when 
    > the data is indexed. 
    
    Yes, parent and child are stored on the same shard.  Which 
    shard a 
    document is stored on depends on the _routing value. 
     Normally, the 
    _routing value is calculated based on the specified or 
    auto-generated 
    _id.   
    
    However with parent-child, the child uses the parent's _id 
    value to 
    calculate the _routing.  If you manually specify a different 
    _routing 
    value for the parent doc, then you need to specify the same 
    _routing 
    value for the child doc to ensure that they end up on the same 
    shard. 
    
    clint 

--

--

If you can upgrade to 0.20.2 you can do something like
this: https://gist.github.com/4569086

On Friday, January 18, 2013 5:00:05 AM UTC-5, Mario wrote:

Unfortunately I cannot pre-calculate everything since the score depends on
the query because only the children that match any of the terms in the
query are involved in the scoring calculation. The rest of the children are
irrelevant and there are also many of them so I like to avoid to load all
of them for the calculation.

On Friday, January 18, 2013 10:53:51 AM UTC+1, Clinton Gormley wrote:

On Fri, 2013-01-18 at 01:50 -0800, Mario wrote:

Hi Clinton,

Thanks for clarifying how parent/child relations are sharded and for
the interesting input to our relevance problem. It looks as if it is
something like this I am looking for but one thing strikes me though
with the top_children functionality. We cannot not use standard Lucene
score functionality. We want to apply our own scoring script, which
needs to use the scoring data stored in the matching children to
calculate parent relevance. It doesn't look as if this is not possible
with top_children unless I misunderstand how the scoring works with
this type of query. Do you have any suggestions how I can do this? We
also really need to recalculate all children when the parent is
updated so nested documents are maybe a better choice. Would this give
us additional options to solve the relevance problem?

If you need to involve all children in the calculation, then yes, it
does sound like nested docs will be a better solution.

Preferably, if you can, perform the calculation before you index, and
store it in the root object (ie the top level document)

clint

Thanks
Mario

On Friday, January 18, 2013 10:21:48 AM UTC+1, Clinton Gormley wrote:
Hi Mario

    On Thu, 2013-01-17 at 12:58 -0800, Mario wrote: 
    > I am trying to implement a custom relevance algorithm with 
    > ElasticSearch using a parent/child document relationship. I 
    understand 
    > that classic relational database joins are not possible but 
    there is 
    > some limited join functionality using parent/child or 
    nested 
    > documents. I like to calculate a relevance score for the 
    parent 
    > document from matching child documents using a custom script 
    or 
    > similar. 
    
    This can be achieved both with parent/child and with nested 
    docs - the 
    choice between the two depends upon other factors such as 
    whether you 
    will frequently update the child docs independent of the 
    parent. 
    
    >  An example query could be: 
    > 
    > 
    > { 
    > "query": { 
    >  "has_child":{ 
    >     "type":"relevance", 
    >     "query":{ 
    >         "terms":{ 
    >             "search_term":["term_1", "term2", "term3"], 
    >             "minimum_match": 1 
    >          } 
    >     } 
    >  } 
    > } 
    > } 
    
    The has_child query is really just a filter - it returns 
    parent_ids that 
    have children that match the filter. 
    
    For scoring, you should use the top_children query instead. 

http://www.elasticsearch.org/guide/reference/query-dsl/top-children-query.html

    > It is unclear to me whether ElasticSearch keeps parents and 
    children 
    > on the same shard automatically but it really seems to be 
    the most 
    > meaningful choice otherwise I cannot compute the parent 
    relevance 
    > correctly. I also have ElasticSearch generating the 
    identifiers when 
    > the data is indexed. 
    
    Yes, parent and child are stored on the same shard.  Which 
    shard a 
    document is stored on depends on the _routing value. 
     Normally, the 
    _routing value is calculated based on the specified or 
    auto-generated 
    _id.   
    
    However with parent-child, the child uses the parent's _id 
    value to 
    calculate the _routing.  If you manually specify a different 
    _routing 
    value for the parent doc, then you need to specify the same 
    _routing 
    value for the child doc to ensure that they end up on the same 
    shard. 
    
    clint 

--

--

That is pretty awesome, Igor. It is exactly something like this. I ran the
test and it comes out with doc 1 on top as expected. That is deviously good
:slight_smile:

I will play around with it and see how it goes. Thank you very much.

On Friday, January 18, 2013 11:12:31 PM UTC+1, Igor Motov wrote:

If you can upgrade to 0.20.2 you can do something like this:
https://gist.github.com/4569086

On Friday, January 18, 2013 5:00:05 AM UTC-5, Mario wrote:

Unfortunately I cannot pre-calculate everything since the score depends
on the query because only the children that match any of the terms in the
query are involved in the scoring calculation. The rest of the children are
irrelevant and there are also many of them so I like to avoid to load all
of them for the calculation.

On Friday, January 18, 2013 10:53:51 AM UTC+1, Clinton Gormley wrote:

On Fri, 2013-01-18 at 01:50 -0800, Mario wrote:

Hi Clinton,

Thanks for clarifying how parent/child relations are sharded and for
the interesting input to our relevance problem. It looks as if it is
something like this I am looking for but one thing strikes me though
with the top_children functionality. We cannot not use standard Lucene
score functionality. We want to apply our own scoring script, which
needs to use the scoring data stored in the matching children to
calculate parent relevance. It doesn't look as if this is not possible
with top_children unless I misunderstand how the scoring works with
this type of query. Do you have any suggestions how I can do this? We
also really need to recalculate all children when the parent is
updated so nested documents are maybe a better choice. Would this give
us additional options to solve the relevance problem?

If you need to involve all children in the calculation, then yes, it
does sound like nested docs will be a better solution.

Preferably, if you can, perform the calculation before you index, and
store it in the root object (ie the top level document)

clint

Thanks
Mario

On Friday, January 18, 2013 10:21:48 AM UTC+1, Clinton Gormley wrote:
Hi Mario

    On Thu, 2013-01-17 at 12:58 -0800, Mario wrote: 
    > I am trying to implement a custom relevance algorithm with 
    > ElasticSearch using a parent/child document relationship. I 
    understand 
    > that classic relational database joins are not possible but 
    there is 
    > some limited join functionality using parent/child or 
    nested 
    > documents. I like to calculate a relevance score for the 
    parent 
    > document from matching child documents using a custom script 
    or 
    > similar. 
    
    This can be achieved both with parent/child and with nested 
    docs - the 
    choice between the two depends upon other factors such as 
    whether you 
    will frequently update the child docs independent of the 
    parent. 
    
    >  An example query could be: 
    > 
    > 
    > { 
    > "query": { 
    >  "has_child":{ 
    >     "type":"relevance", 
    >     "query":{ 
    >         "terms":{ 
    >             "search_term":["term_1", "term2", "term3"], 
    >             "minimum_match": 1 
    >          } 
    >     } 
    >  } 
    > } 
    > } 
    
    The has_child query is really just a filter - it returns 
    parent_ids that 
    have children that match the filter. 
    
    For scoring, you should use the top_children query instead. 

http://www.elasticsearch.org/guide/reference/query-dsl/top-children-query.html

    > It is unclear to me whether ElasticSearch keeps parents and 
    children 
    > on the same shard automatically but it really seems to be 
    the most 
    > meaningful choice otherwise I cannot compute the parent 
    relevance 
    > correctly. I also have ElasticSearch generating the 
    identifiers when 
    > the data is indexed. 
    
    Yes, parent and child are stored on the same shard.  Which 
    shard a 
    document is stored on depends on the _routing value. 
     Normally, the 
    _routing value is calculated based on the specified or 
    auto-generated 
    _id.   
    
    However with parent-child, the child uses the parent's _id 
    value to 
    calculate the _routing.  If you manually specify a different 
    _routing 
    value for the parent doc, then you need to specify the same 
    _routing 
    value for the child doc to ensure that they end up on the same 
    shard. 
    
    clint 

--

--