If I have many posts that belong to a particular user, is it better to do a parent/child relationship or denormalize the data by putting the user in every post document.
In my use-case, the user is regularly being updated, and may have thousands of posts.
I would need to update_by_query all posts that have that user in order to keep the user data accurate.
I would potentially need to update a million or more documents whenever we process new posts (thousands of users with thousands of posts).
With my situation, would it be better to use the join datatype and take a hit on query performance?
If you are updating the users infrequently, updating a million documents per day will not necessarily take a lot of time or resources from the cluster. Doing this denormalization typically leads to simpler queries and better performance. If you are updating users more frequently the balance may change. I do not think anyone but you can determine what the balance is for your particular use case.
The updates happen by a scheduled job, so I could be updating a million users every 10min or so.
parent/child is preferred since I would be able to query for a single parent that matches many child conditions. For example, must [post.type = "tech", post.type = "finance"], to find the profile with both. In the denormalized example this won't match. We have a UI in front of a query builder that enforces some of these scenarios, and searching becomes difficult with denormalized documents.
If you have experience with parent/child joins, how much of a hit performance-wise should I expect?
I can see that the documentation warns against it, but I feel like my use case calls for it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.