Denormalized Data vs. Joins

#1

If I have many posts that belong to a particular user, is it better to do a parent/child relationship or denormalize the data by putting the user in every post document.

In my use-case, the user is regularly being updated, and may have thousands of posts.

I would need to update_by_query all posts that have that user in order to keep the user data accurate.

I would potentially need to update a million or more documents whenever we process new posts (thousands of users with thousands of posts).

With my situation, would it be better to use the join datatype and take a hit on query performance?

Thanks for your help.

(Christian Dahlqvist) #2

If you are updating the users infrequently, updating a million documents per day will not necessarily take a lot of time or resources from the cluster. Doing this denormalization typically leads to simpler queries and better performance. If you are updating users more frequently the balance may change. I do not think anyone but you can determine what the balance is for your particular use case.

#3

The updates happen by a scheduled job, so I could be updating a million users every 10min or so.

parent/child is preferred since I would be able to query for a single parent that matches many child conditions. For example, must [post.type = "tech", post.type = "finance"], to find the profile with both. In the denormalized example this won't match. We have a UI in front of a query builder that enforces some of these scenarios, and searching becomes difficult with denormalized documents.

If you have experience with parent/child joins, how much of a hit performance-wise should I expect?

I can see that the documentation warns against it, but I feel like my use case calls for it.

Thanks.