Hi team! I'm not experienced in Elasticsearch.
I'm looking for a new technical solution to dashboard generation with filters and aggregations queries, so I decided to look at Elasticsearch.
However, my data is composed of two sides, and both have a unique id.
The first side comprises an id(millions/billions of ids) and multiple other fields. And all ids from each field are updated in batches(spark jobs) (hypothetically, more than 100 batches per day);
The second side is organized by the id and some other static fields (usually, thousands or millions ids | near to a thousand sets);
The first one contemplates all existent ids, whereas the second is multiple sets of ids.
So, in the SQL universe, a join by id from the main table (first side) with each set of ids (second) creates a view. Then, I would have multiple views with constantly updated values from the first side whenever I query them.
Therefore, I thought of two solutions.
I1. Create an index with the first side and query [(filters, terms ("group by count"), etc.) + filter ids] from the second side.
I would have some problems with second-side static fields. And ids filters are not great performance helpers [although with larger clusters, it gets better, right?].
I2. Create multiple indexes with previously joined data from both sides.
All queries would work perfectly for every scenario.
However, I would need to update all indexes when processing the batch.
I took at update_by_query API and would solve, but how would these multiple updates perform? Hours? And these updates are not resilient. So a fallback would be necessary, right?
Thanks in advance (please let me know if something isn't clear).