I have about 100million "parent" documents with about 20 "random" products in a field with an array of these 20 products. These documents includes field values that average and sum values from the 20 products, and these aggregated values are the searched fields.
Right now there are 100s of millions of product docs, and each could have 10s of thousands of parent documents that they are a part of.
So whenever a value changes on a product that is part of the calculated values of the "parent" I would like all of the aggregated values on the parents to update.
Right now, I have an off server script that just goes through each parent document, grabs the current values for each product, aggregates and updates the parent doc fields. The current way makes it impossible to get near realtime accuracy for the parent doc searches on aggregated fields, in fact it can take days to update all 100million. Now, near realtime isn't completely necessary, but if I can do it that way, it would be preferable.
So what I want to do is run an update by query to find all the parent docs that have the product, and for each parent doc to run a script to reaggregate values and update corresponding fields.
The issue I have is there are some 20,000 products per second that get updated, and doing 20,000 update by queries might have a huge performance hit because that would cause many times that number of parent docs that would need updating per second.
I haven't tried the update by query method yet, I am in the middle of building a new cluster that will use ES6.x and will use highend m.2 ssds for hot data nodes, around 10 of these nodes to start.
Will the update by query method be feasible, it's possible that I can cut 20,000 ubq to less than half because not all 20,000 product updates will require an ubq because the aggregated values dont always change.
What's best, most performance, is there a different way?
Possibly could do a hybrid method too?