Why do grandchildren need to specify the grandparents with the routing option?

If you chose to model your data with parent/child then the parent/child memory structure is going to take a significant portion of your heap space. Fortunately from 2.0 parent/child memory structure will be using doc values and then you shouldn't worry about heap memory for parent/child any more.

On a 1.x release the following factors determine the amount of memory the parent/child data structure is going to take on the heap space for a particular shard:

  1. The number of unique parent ids. The parent ids are stored as utf8 strings.
  2. The amount of parent and child documents. Each document has an entry point in this p/c data structure. These entry points are compressed.
  3. The number of segments. The parent/child data structure is per segment. If a parent and all its children appear in a single segment then the parent id those documents are sharing only appears in that segment. But if the parent and its child documents are scattered across segments then the parent id those docs share is being duplicated between segments. This usually is the case for newly introduced documents and over time as segments get merged parent and child documents are more likely to end up in the same segment and the duplication factor is then going to drop.

The number of unique parent ids is the most dominant factor. The other two factors are less dominant and also harder to estimate, because the entry points for the documents get compressed and the number of segments varies over time.

So what you can do to very roughly estimate the used heap memory for parent/child:
num_parent_ids * longest_id_length

A shard that gets heavily indexed into, the number of segments should be somewhere between 20 and 100. Not all parent ids are going to be duplicated in all segments, but certainly a number of segments, lets assume 10. So then you should multiple the number that resulted from the previous estimation with 10.

Personally I find it tricky to estimate this and I usually experiment with a subset of the data (which must be representable for the entire data set) and based on these findings make my estimates.

Looking at your last message, I think you should really evaluate if parent/child is necessary in your situation. In your case it looks like you're dealing with time based data and in that case parent/child isn't the best way of modelling your data, since after some time you stop writing into indices. So the need for flexibility parent/child offers isn't needed and de-noralization is then a valid option too. Beyond additional heap memory, parent/child also has a big impact on query performance (to perform the actual join). This has significantly improved as well in the upcoming 2.0 release, but still isn't cheap.

1 Like