Why do grandchildren need to specify the grandparents with the routing option?

(Yehosef) #1

I was reading through https://www.elastic.co/guide/en/elasticsearch/guide/current/grandparents.html and I don't understand why i should need to use it. Using the example given of country > branch > employee, if I already have a branch that exists - it would already be on the country shard. So when I add employee with the parent branch - it would also be on the same shard - why do I need routing?

The exception I could think of is if I'm adding the grandchild before the parent exists, so it can't use that data - but that seems like an odd flow.

Can someone shed light on this? Why is routing needed?

(Colin Goodheart-Smithe) #2

The reason for this has to do with how document routing is done (how we determine which shard a particular document lives on).

By default, the routing is done by taking the id of the document (as specified in the URL path when indexing), getting a hash of that value and then getting the modulo of the value wrt the number of shards (the remainder if we integer divide the hash value by the number of shards). So, for example if I create an index with 5 shards, and index a document with id of foo then we do the following:

shardNum = hash("foo") % 5

Which gives us a number between 0 and 4. Lets say in this case the shardNum is 3 so the document will be indexed on shard 3 of that index. The same applies when you GET a document. You send a request like the following:

curl -XGET "http://localhost:9200/my_index/my_type/foo"

From just this information we need to know which shard to go and get the document from so we use the same formula as above and hash "foo" and then get the modulo wrt the number of shards (5) which will again be 3 and we can go to the correct shard to get the document.

With parent-child documents it works slightly differently because when indexing a child document instead of using the id as the routing value that we hash, we use the parent query parameter. this means that both the parent and child end up on the same shard because the result of the routing algorithm will be the same for both.

When you index a grandchild the parent query parameter points to the child document and we have no reference in the request that there is a parent above that child. Your request in that case might look like this:

curl -XGET "http://localhost:9200/my_index/my_type/baz?parent=bar"

Now because the request does not mention the fact that the parent bar might have a parent of it's own then Elasticsearch would use bar as the routing value in the routing algorithm and the document may end up on a different shard to its parent (and grandparent) which had foo used as its routing value (I am assuming here that the document with id bar was indexed referencing a parent foo).

So we need to tell Elasticsearch that it should not use the parent as the routing value for the grandchild document but instead should use the id of the document at the top generation (in this case "foo"). We have to do this by setting a custom routing value:

curl -XGET "http://localhost:9200/my_index/my_type/baz?parent=bar&routing=foo"

Does that make sense?

(Yehosef) #3

Thanks for the detailed explanation. I understand better now. You are determining the shard based on a hashing algorithm of the parent name - not the actual shard location of the parent.

Is the reason for this approach that it's too slow to figure out what shard the parent is on via a query? (this would all be internal on the Elasticsearch level - the client wouldn't need to do a specific query.) It could be that even if this is slower, it make sense to provide it as an option because it's not clear that a child would immediately know it's grandparent. In the case given presumably someone that works in "london" knows already their are in "uk". But if I had a comment on a picture in a post, it's possible that the when indexing the comment doesn't know the post id and I would have to query some data store to pull out that info anyway.

Using the approach of "routing" - you would alway put in the "oldest" ancestor if I had multiple levels correct?

(Colin Goodheart-Smithe) #4

Yes but also because when you do a has_child search query where you are looking for parent documents whose children have specific attributes you need to have both the parent and children in the same place (shard) so you don't have to do lots of network hops between shards gathering and processing the join. Distributed joins are very problematic, and do not scale well.

Yes, to do this you would need to perform a query to find the post id and then use it as the routing value.

Yes, the routing value is always the id of the document at the top of the hierarchy.

(Yehosef) #5

Thanks again. I understand why I need the children on the same shard as the parent, what I'm still not sure I understand is why I need to specify the grandparent.

Let's say I have a machine with 5 shards. I index a parent document "ABC" which, because if it's id, is hashed to shard 2. I now want to add a child document "KLM". Normally when I add a document I would decide the shard based on the id - so it this case it would have ended up on shard 1. But distributed joins don't work as you pointed out. So when I'm indexing the child, I need to tell it the parent id so it will hash to the right value and also be on shard 2.

Let's say I now want to query the parent and child (not as parent child, but independently.) When I ask for /parent/ABC it'll hash "ABC" and ask on the right shard. What happens when I ask for /child/KLM ? How does it how what shard to go to? I assume it's either stored in some index or in memory to make that performant (I really don't understand Elasticsearch internals yet.)

Now I want to add a grandchild, XYZ as a child of KLM. You're saying that I need to say that its parent is KLM and routing=ABC. What I'm not clear about is, just as ES can figure out what shard KLM is on when I ask /child/KLM, why can't I use this same knowledge to handing the routing automatically? It's go to the same shard as KLM, which is already on the same shard of ABC. I don't understand why need to specify.

The other thing I don't understand is a quote in the https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html page.

At the time of going to press, the parent-child ID map is held in memory as part of field data. There are plans afoot to change the default setting to use doc values by default instead.

Does this mean all parent child relationships in the entire index are stored, or just the ones that are in active use - so I just need to be careful about my heap space when doing large queries or aggregations using the parent-child relationship?

(Colin Goodheart-Smithe) #6

If you do a GET request for the child document you need to include the parent query parameter so the request is routed to the correct shard. Their is no map of child to parent stored in Elasticsearch for the purposes of routing requests. This is the reason that you need to specify the routing for the grandchild because Elasticsearch doesn't know on which shard the parent or child is stored.

This parent-child ID map is a per shard data structure that stores the parent ID of each child document (the child documents ID here is an internal lucene id) for that shard. It is used at search time to perform the query-time joins.

Yes, all parent-child relationships in the entire index are stored in this memory structure so you will need to be careful about heap space but it is loaded once and cached so its heap usage shouldn't change over time (given a static index, if you index more documents it will grow).

(Yehosef) #7

Ok, I understand better now - thanks.

Do you have recommendations on how to estimate the memory usage for this case. We are working on aggregating user activity by day/week/month. Originally we were going to store the aggregations as nested documents but I thought that by storing them as children documents it might be more flexible. I could query then more or less as the same use case for nested documents, but I could also do aggregations on the daily aggregations independently.

For example I could show what the behaviour trends of people in their two weeks after signup. We would have around 50M users with varying numbers of daily/weekly/monthly aggregations. Let's say on average we have 20 children documents per member. I would probably partition the users by signup date, but not sure. I don't understand how the heap space will affect us (I've avoided the heap space issue for the most part by using doc_values for the raw data - We currently have about 700M rows/ 315GB data on a 3 machine cluster with 36GB RAM total)

(Martijn Van Groningen) #8

If you chose to model your data with parent/child then the parent/child memory structure is going to take a significant portion of your heap space. Fortunately from 2.0 parent/child memory structure will be using doc values and then you shouldn't worry about heap memory for parent/child any more.

On a 1.x release the following factors determine the amount of memory the parent/child data structure is going to take on the heap space for a particular shard:

  1. The number of unique parent ids. The parent ids are stored as utf8 strings.
  2. The amount of parent and child documents. Each document has an entry point in this p/c data structure. These entry points are compressed.
  3. The number of segments. The parent/child data structure is per segment. If a parent and all its children appear in a single segment then the parent id those documents are sharing only appears in that segment. But if the parent and its child documents are scattered across segments then the parent id those docs share is being duplicated between segments. This usually is the case for newly introduced documents and over time as segments get merged parent and child documents are more likely to end up in the same segment and the duplication factor is then going to drop.

The number of unique parent ids is the most dominant factor. The other two factors are less dominant and also harder to estimate, because the entry points for the documents get compressed and the number of segments varies over time.

So what you can do to very roughly estimate the used heap memory for parent/child:
num_parent_ids * longest_id_length

A shard that gets heavily indexed into, the number of segments should be somewhere between 20 and 100. Not all parent ids are going to be duplicated in all segments, but certainly a number of segments, lets assume 10. So then you should multiple the number that resulted from the previous estimation with 10.

Personally I find it tricky to estimate this and I usually experiment with a subset of the data (which must be representable for the entire data set) and based on these findings make my estimates.

Looking at your last message, I think you should really evaluate if parent/child is necessary in your situation. In your case it looks like you're dealing with time based data and in that case parent/child isn't the best way of modelling your data, since after some time you stop writing into indices. So the need for flexibility parent/child offers isn't needed and de-noralization is then a valid option too. Beyond additional heap memory, parent/child also has a big impact on query performance (to perform the actual join). This has significantly improved as well in the upcoming 2.0 release, but still isn't cheap.

(system) #9