Parent/Child use case


(Paul Smith) #1

Just wanted to check whether this scenario fit properly the parent/child
mapping feature.

We currently index just meta-data of documents (dozens of fields), however
we want to index file contents too as that's sometimes useful for our
customers (our use case, the meta data is the primary mechanism). Since we
have hundreds of millions of document records, and 100Tb+ filesize, it's a
non-trivial exercise we've managed to put off for a while.

Since any reindex requires indexing both meta-data and file content, which
for us is kept separately in DB & fileserver respectively, I didn't want any
meta-data update to also require a seek of the filestore to get the text
content for indexing especially since the text-content of a file never
changes (for us). I was hoping to find a way to keep meta-data and text
content separate in the index and meta- updates update independently.

I was thinking of having a parent/child relationship between the
meta(parent) and the full text (child), allowing the parent to update
(frequent) leaving the child pretty much alone once text extracted and
indexed. Text extraction of newly uploaded files can be done async, and a
new child record added in ES independent of the registration of the document
meta data record.

Does this sound the right use case for parent/child in ES?

As I understand it, if we needed to reindex (say, new fields, or changed
values or something) then we'd also have to reindex the children, but we
could do these 2 reindex operations separately, mark the full text bit
'offline' until that's completed, allowing the meta-data to be searched much
earlier.

Shay, it would help too in the Docs if the 'parent/chi/d' bit referred to
frequently is easy to find in the docs, I'm presuming it's the 'nested'
mapping type.. ? I see reference to parent/child in the forums etc, and
took me a while to bump into it in the docs when I went looking. I could be
blind though!

thanks,

Paul


(ppearcy) #2

Hey Paul,
First off, the parent/child support and the nested support are
actually two different features. The nested support seems to be the
more feature rich variant, though.

I was interested in both these features for storing a frequently
changing popularity score without needing to re-index the document.
Unfortunately, neither of these features fit my use case. Parent/child
can be indexed separately, but it isn't possible to join the child
document for the sorting of the parent. For nested, all data needs to
be re-indexed due to how the data is stored internally in ES.

We avoid having to hit our backend datastore for meta updates by
pulling the data in ES, updating it and resubmitting. Although, there
is a new plugin that does exactly this. Either way, the whole document
needs to get re-indexed.

Don't take this as the definitive answer, though, I've only briefly
played around with both these features.

Best Regards,
Paul

On Sep 18, 6:41 pm, Paul Smith tallpsm...@gmail.com wrote:

Just wanted to check whether this scenario fit properly the parent/child
mapping feature.

We currently index just meta-data of documents (dozens of fields), however
we want to index file contents too as that's sometimes useful for our
customers (our use case, the meta data is the primary mechanism). Since we
have hundreds of millions of document records, and 100Tb+ filesize, it's a
non-trivial exercise we've managed to put off for a while.

Since any reindex requires indexing both meta-data and file content, which
for us is kept separately in DB & fileserver respectively, I didn't want any
meta-data update to also require a seek of the filestore to get the text
content for indexing especially since the text-content of a file never
changes (for us). I was hoping to find a way to keep meta-data and text
content separate in the index and meta- updates update independently.

I was thinking of having a parent/child relationship between the
meta(parent) and the full text (child), allowing the parent to update
(frequent) leaving the child pretty much alone once text extracted and
indexed. Text extraction of newly uploaded files can be done async, and a
new child record added in ES independent of the registration of the document
meta data record.

Does this sound the right use case for parent/child in ES?

As I understand it, if we needed to reindex (say, new fields, or changed
values or something) then we'd also have to reindex the children, but we
could do these 2 reindex operations separately, mark the full text bit
'offline' until that's completed, allowing the meta-data to be searched much
earlier.

Shay, it would help too in the Docs if the 'parent/chi/d' bit referred to
frequently is easy to find in the docs, I'm presuming it's the 'nested'
mapping type.. ? I see reference to parent/child in the forums etc, and
took me a while to bump into it in the docs when I went looking. I could be
blind though!

thanks,

Paul


(Paul Smith) #3

Thanks for the reply!

On 21 September 2011 08:20, ppearcy ppearcy@gmail.com wrote:

Hey Paul,
First off, the parent/child support and the nested support are
actually two different features. The nested support seems to be the
more feature rich variant, though.

scratches head So is there a web page on elasticsearch.com that details
the parent/child? I'm going blind, I could only find the 'nested' one
then.. ?

I was interested in both these features for storing a frequently
changing popularity score without needing to re-index the document.
Unfortunately, neither of these features fit my use case. Parent/child
can be indexed separately, but it isn't possible to join the child
document for the sorting of the parent. For nested, all data needs to
be re-indexed due to how the data is stored internally in ES.

By the '... it isn't possible to join the child document for the sorting of
the parent' part. I don't need to sort by any child value in this case, I
just need to be able to match on text in the child value sometimes and
return the parent as the hit.

So, if the parent ES document has a field called "documentnumber", and the
child is the text contents of the file attached to this parent and the child
has a field "contents", then if I search for:

documentnumber:ABC-123 OR content:foo

then the results should return any parent which has the field documentnumber
with that match, PLUS any parent's whose children have 'foo' in the content.
I then sort by one of the parent's fields.

Would this work? I'm hoping to optionally allow the customer to search the
contents of the file, but return it 'inline' with other matches of the
parent meta-record.

thanks,

Paul


(ppearcy) #4

Ah... I think that might work. Here is the best overall description
I can find on things:

And here are the relevant query types that can be run:
http://www.elasticsearch.org/guide/reference/query-dsl/top-children-query.html
-> I think you'd want this one since it scores
http://www.elasticsearch.org/guide/reference/query-dsl/has-child-query.html
http://www.elasticsearch.org/guide/reference/query-dsl/has-child-filter.html

Each one seems to query the children and return details on the parent.
Since you have a 1 to 1 mapping of parent to child, I think
top_children would work correctly.

I'd be curious to know if this ends up working for you, since I've
always had issues wrapping my head around the primary use cases for
this feature :slight_smile:

Best Regards,
Paul

On Sep 20, 5:21 pm, Paul Smith tallpsm...@gmail.com wrote:

Thanks for the reply!

On 21 September 2011 08:20, ppearcy ppea...@gmail.com wrote:

Hey Paul,
First off, the parent/child support and the nested support are
actually two different features. The nested support seems to be the
more feature rich variant, though.

scratches head So is there a web page on elasticsearch.com that details
the parent/child? I'm going blind, I could only find the 'nested' one
then.. ?

I was interested in both these features for storing a frequently
changing popularity score without needing to re-index the document.
Unfortunately, neither of these features fit my use case. Parent/child
can be indexed separately, but it isn't possible to join the child
document for the sorting of the parent. For nested, all data needs to
be re-indexed due to how the data is stored internally in ES.

By the '... it isn't possible to join the child document for the sorting of
the parent' part. I don't need to sort by any child value in this case, I
just need to be able to match on text in the child value sometimes and
return the parent as the hit.

So, if the parent ES document has a field called "documentnumber", and the
child is the text contents of the file attached to this parent and the child
has a field "contents", then if I search for:

documentnumber:ABC-123 OR content:foo

then the results should return any parent which has the field documentnumber
with that match, PLUS any parent's whose children have 'foo' in the content.
I then sort by one of the parent's fields.

Would this work? I'm hoping to optionally allow the customer to search the
contents of the file, but return it 'inline' with other matches of the
parent meta-record.

thanks,

Paul


(micuenta99) #5

Hi Paul,
I also am interested in this scenario ( 'doc' (parent) and a
'filecontent' (child)) At the end how you did it do?
Thanks,
On 21 sep, 00:21, Paul Smith tallpsm...@gmail.com wrote:

Thanks for the reply!

On 21 September 2011 08:20, ppearcy ppea...@gmail.com wrote:

Hey Paul,
First off, the parent/child support and the nested support are
actually two different features. The nested support seems to be the
more feature rich variant, though.

scratches head So is there a web page on elasticsearch.com that details
the parent/child? I'm going blind, I could only find the 'nested' one
then.. ?

I was interested in both these features for storing a frequently
changing popularity score without needing to re-index the document.
Unfortunately, neither of these features fit my use case. Parent/child
can be indexed separately, but it isn't possible to join the child
document for the sorting of the parent. For nested, all data needs to
be re-indexed due to how the data is stored internally in ES.

By the '... it isn't possible to join the child document for the sorting of
the parent' part. I don't need to sort by any child value in this case, I
just need to be able to match on text in the child value sometimes and
return the parent as the hit.

So, if the parent ES document has a field called "documentnumber", and the
child is the text contents of the file attached to this parent and the child
has a field "contents", then if I search for:

documentnumber:ABC-123 OR content:foo

then the results should return any parent which has the field documentnumber
with that match, PLUS any parent's whose children have 'foo' in the content.
I then sort by one of the parent's fields.

Would this work? I'm hoping to optionally allow the customer to search the
contents of the file, but return it 'inline' with other matches of the
parent meta-record.

thanks,

Paul


(Paul Smith) #6

On 16 November 2011 03:31, micu99 micuenta99@gmail.com wrote:

Hi Paul,
I also am interested in this scenario ( 'doc' (parent) and a
'filecontent' (child)) At the end how you did it do?

I haven't gotten around to really seriously giving this a try. I had a
quick go with the Top Children but got stuck with a syntax error (my own
fault I'm certain) and then was involved in some other things, so haven't
gotten back to this one as yet sorry!


(micuenta99) #7

Thanks Paul,

Can anyone help?
What is the best way to implement a scenario like the one that comments
Paul?

2011/11/16 Paul Smith tallpsmith@gmail.com

On 16 November 2011 03:31, micu99 micuenta99@gmail.com wrote:

Hi Paul,
I also am interested in this scenario ( 'doc' (parent) and a
'filecontent' (child)) At the end how you did it do?

I haven't gotten around to really seriously giving this a try. I had a
quick go with the Top Children but got stuck with a syntax error (my own
fault I'm certain) and then was involved in some other things, so haven't
gotten back to this one as yet sorry!


(system) #8