Data modelling to avoid reprocessing

Hi team! I'm not experienced in Elasticsearch.
I'm looking for a new technical solution to dashboard generation with filters and aggregations queries, so I decided to look at Elasticsearch.
However, my data is composed of two sides, and both have a unique id.
The first side comprises an id(millions/billions of ids) and multiple other fields. And all ids from each field are updated in batches(spark jobs) (hypothetically, more than 100 batches per day);
The second side is organized by the id and some other static fields (usually, thousands or millions ids | near to a thousand sets);
The first one contemplates all existent ids, whereas the second is multiple sets of ids.

So, in the SQL universe, a join by id from the main table (first side) with each set of ids (second) creates a view. Then, I would have multiple views with constantly updated values from the first side whenever I query them.

Therefore, I thought of two solutions.
I1. Create an index with the first side and query [(filters, terms ("group by count"), etc.) + filter ids] from the second side.
I would have some problems with second-side static fields. And ids filters are not great performance helpers [although with larger clusters, it gets better, right?].

I2. Create multiple indexes with previously joined data from both sides.
All queries would work perfectly for every scenario.
However, I would need to update all indexes when processing the batch.
I took at update_by_query API and would solve, but how would these multiple updates perform? Hours? And these updates are not resilient. So a fallback would be necessary, right?

Thanks in advance (please let me know if something isn't clear).

Hi @lunea. It can be hard to transition from the SQL world to the no-SQL (like Elasticsearch) world. Instead of figuring out the most normalized way to model your data, you want to ask yourself how your application is going to be accessing the data, and back into the data model from that. Joins are generally bad, so don't be afraid to denormalize your data in Elasticsearch a lot more than you would in a SQL database for best application performance. Having said that, updates are obviously not free, so if you find yourself having to update data in millions of places with one change, you've probably denormalized too far.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.