Github issues mentions that this is the place to come for answers... so I'll ask again!
I have a valid scenario where I have a parent/child relationship (hence the desire for a join field) but some documents that would otherwise be children do not have a parent value. Business is the parent doc. Event is the child doc, but it may NOT be associated with a Business (it could be a community-run event).
When indexing data, I cannot set the parent value of a child's join field data to null:
[parent] is missing for join field [relation]
In order to get this to work, I can set a dummy value like "noparent-{}".format(child_obj.id) and of course I must use that for _routing (Is this is recommended/only approach here?); a child object with a parent would just use child_obj.business_id as the parent/routing values.
Coming from a user's perspective, it doesn't make sense to discount the possibility of parentless child documents with regards to join fields. Almost all of the GH issue commentary I've looked at in regards to this scenario seems to be very ES-product focused, rather than user/customer use case focused.
I don't think the argument of "if you later set the parent for a childless Event, then you need to be sure you delete it from the appropriate shard" (seen in GH commentary) really makes that much sense either, as the same action must be taken if you change the parent for an Event with a parent (due to shard routing differences).
Would love to get the expert, canonical advice for how to deal with this scenario. (Oh, and also have support for parentless children in ES! )
Ok, let's try to make this a little more concrete with an example: We are creating one company and one event organized by that company plus another parentless event. Finally, we want to check on which shards they ended up.
By pure coincidence they are all ending up on the same shard [my_index][3]. Though if you add the following document you can see that it ends up on shard 0 and everything works as expected.
_routing is the only way to work with the children of a parent-child relationship.
While you cannot skip the parent, a dummy value (that doesn't need to exist) will work just fine. Though maybe having an actual parent for orphans might make your code simpler, so you could just create an orphan company for that.
The main problem with a dummy or an orphan company is that your children might be unbalanced; if 50% of your events don't have an actual company then those 50% would all go to the same shard. This might be an issue with huge datasets, but for many datasets you'll be fine with a single shard and there is no balancing to consider. For big datasets you should pick a different routing key — it shouldn't really matter what (since you don't need to put the documents on a specific shard) as long as it's more or less evenly distributed.
So while we don't explicitly support parentless children, they should work fine for your usecase. Though if you have many parentless children then parent-child might just be the wrong datastructure for your scenario.
For switching parent or deleting children: Building distributed systems is a complicated topic. You are running into two tradeoffs of Elasticsearch here:
No multi document transactions.
Children need to live with their parents (on the same shard).
We are generally trying not to hide limitations behind syntactic sugar, since this will bite you at some point (normally when something fails). And speaking of syntactic sugar, we are trying to keep our APIs small — otherwise maintenance would get out of hand. I'm not aware of any development for additional helper APIs around parent-child at the moment, but that doesn't mean it won't ever happen.
Thanks for the reply. I'm going with a business-<business-id> parent document ID where a parent exists, and noparent-<child-id> for parent-less (which prevents overloading of a single shard). I do need to handle routing differently for each use case, but that is manageable.
It feels kind of hackish to do it this way, but I understand the rationale behind all of this much better now. Thanks again.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.