Scoring a parent document search by a count of children matching part of the query?


(tbrianjones) #1

I have an index with parent documents ( Companies ), that have children (
Files ). Each Company can have hundreds of Files. Companies and Files
both have many fields.

The search I'm trying to perform is the Company that best matches based on
it's own fields and the fields of it's children ( the Files ). The current
query I run is a Bool-Should query where I perform a has_child query on the
files and a regular query on the Companies. I only require a minimum of
one match so, as I understand it, a Company that matches it's own fields
and one of it's children will score higher than a Company that only
matchesit's own fields. You'll see I also have to apply a nuber of filters
to the Companies.

I'm wondering if there is a way to query the system where it will take all
the children into account, and not just one. If ten Files match the query,
then that Company result would likely score higher than a Company that only
had a few files match ... obviously there would be other scoring going on
... so maybe some sort of multiplyer applied to the sum of children scores
would be appropriate. It's defining a query that matches multiple children
that I'm unable to figure out.

Here is an example of the query that I currently use:

{
"query": {
"filtered": {
"filter": {
"and": [
{
"terms": {
"_cache": true,
"execution": "or",
"locations.state": [
"california",
"maryland"
]
}
},
{
"terms": {
"_cache": true,
"execution": "and",
"industries.term.not_analyzed": [
"aerospace",
"defense"
]
}
},
{
"geo_distance": {
"locations.geolocation": {
"lat": "41",
"lon": "-82"
},
"distance": "25mi"
}
}
]
},
"query": {
"bool": {
"should": [
{
"query_string": {
"default_field": "_all",
"query": "adhesive"
}
},
{
"has_child": {
"type": "file",
"query": {
"query_string": {
"default_field": "_all",
"query": "adhesive"
}
}
}
}
],
"minimum_number_should_match": 1
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/261146ca-c994-40d4-a970-6b5d872bb13e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(tbrianjones) #2

It seems like nesting the Files within the Company docs may be the only
solution here. That is definitely an option. I had indexed the Files as
children of Companies so that I could query the Files as a separate index (
which I also need to do ), but can maintain a separate index alltogether if
need be.

On Friday, May 9, 2014 9:10:19 AM UTC-7, Brian Jones wrote:

I have an index with parent documents ( Companies ), that have children (
Files ). Each Company can have hundreds of Files. Companies and Files
both have many fields.

The search I'm trying to perform is the Company that best matches based on
it's own fields and the fields of it's children ( the Files ). The current
query I run is a Bool-Should query where I perform a has_child query on the
files and a regular query on the Companies. I only require a minimum of
one match so, as I understand it, a Company that matches it's own fields
and one of it's children will score higher than a Company that only
matchesit's own fields. You'll see I also have to apply a nuber of filters
to the Companies.

I'm wondering if there is a way to query the system where it will take all
the children into account, and not just one. If ten Files match the query,
then that Company result would likely score higher than a Company that only
had a few files match ... obviously there would be other scoring going on
... so maybe some sort of multiplyer applied to the sum of children scores
would be appropriate. It's defining a query that matches multiple children
that I'm unable to figure out.

Here is an example of the query that I currently use:

{
"query": {
"filtered": {
"filter": {
"and": [
{
"terms": {
"_cache": true,
"execution": "or",
"locations.state": [
"california",
"maryland"
]
}
},
{
"terms": {
"_cache": true,
"execution": "and",
"industries.term.not_analyzed": [
"aerospace",
"defense"
]
}
},
{
"geo_distance": {
"locations.geolocation": {
"lat": "41",
"lon": "-82"
},
"distance": "25mi"
}
}
]
},
"query": {
"bool": {
"should": [
{
"query_string": {
"default_field": "_all",
"query": "adhesive"
}
},
{
"has_child": {
"type": "file",
"query": {
"query_string": {
"default_field": "_all",
"query": "adhesive"
}
}
}
}
],
"minimum_number_should_match": 1
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d19b8cda-a4e3-415a-9920-ca48888ec0b4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(tbrianjones) #3

Are there any gotchas I should be aware of when creating a document that
could contain thousands of pages of text ( a Company and thousands of
nested Files ) in addition to dozens/hundreds of fields?

On Friday, May 9, 2014 9:54:40 AM UTC-7, Brian Jones wrote:

It seems like nesting the Files within the Company docs may be the only
solution here. That is definitely an option. I had indexed the Files as
children of Companies so that I could query the Files as a separate index (
which I also need to do ), but can maintain a separate index alltogether if
need be.

On Friday, May 9, 2014 9:10:19 AM UTC-7, Brian Jones wrote:

I have an index with parent documents ( Companies ), that have children (
Files ). Each Company can have hundreds of Files. Companies and Files
both have many fields.

The search I'm trying to perform is the Company that best matches based
on it's own fields and the fields of it's children ( the Files ). The
current query I run is a Bool-Should query where I perform a has_child
query on the files and a regular query on the Companies. I only require a
minimum of one match so, as I understand it, a Company that matches it's
own fields and one of it's children will score higher than a Company that
only matchesit's own fields. You'll see I also have to apply a nuber of
filters to the Companies.

I'm wondering if there is a way to query the system where it will take
all the children into account, and not just one. If ten Files match the
query, then that Company result would likely score higher than a Company
that only had a few files match ... obviously there would be other scoring
going on ... so maybe some sort of multiplyer applied to the sum of
children scores would be appropriate. It's defining a query that matches
multiple children that I'm unable to figure out.

Here is an example of the query that I currently use:

{
"query": {
"filtered": {
"filter": {
"and": [
{
"terms": {
"_cache": true,
"execution": "or",
"locations.state": [
"california",
"maryland"
]
}
},
{
"terms": {
"_cache": true,
"execution": "and",
"industries.term.not_analyzed": [
"aerospace",
"defense"
]
}
},
{
"geo_distance": {
"locations.geolocation": {
"lat": "41",
"lon": "-82"
},
"distance": "25mi"
}
}
]
},
"query": {
"bool": {
"should": [
{
"query_string": {
"default_field": "_all",
"query": "adhesive"
}
},
{
"has_child": {
"type": "file",
"query": {
"query_string": {
"default_field": "_all",
"query": "adhesive"
}
}
}
}
],
"minimum_number_should_match": 1
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/44ac19ce-7571-41e4-a607-5e6b2fbdd1d4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4