I have an index making extensive use of Parent/Child relations, and over time some Child docs have been deleted without removing reference to them in their Parent. This doesn't impact searches but I do use a count of Children in ordering, so I need to prune these 'zombie' references to non-existing Children from Parents.
I can imagine a brute force approach retrieving each Parent having Children, then getting all Child docs and doing a pythonic set comparison of ids, but is there a more efficient approach?
Typically parent-child relationships do not have references to children in the parent as far as I know, so is this something you have added and maitain?
Yes, I have added a children array field, and each time a child is assigned an exiting parent, its _id gets added to the parent's children[]. There's also a searchy[] array, and certain tag terms of the child get added to that.
This seemed at the time (and now) convoluted, but it was the only way I could think of to meet my requirement: searching for a tag and returning one or more parent/child 'clusters'. That is all of the parent+children sets that have a given tag in any of their members, whether parent or child. So the search for tags is limited to parents, and returns the parent and its child _ids. It approximates a graph in a way.
Maintenance is proving to be a struggle, because over time documents can get updated or removed by a web app, meaning surgery: removing a parent that has children requires transferring the parent role to one of them. Removing a child requires removing its _id from the parents children[] field...not to mention maintenance of the searchy[] field.
Seems I've dug myself a hole - suggestions are welcome!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.