Alright so I think I have a somewhat interesting query problem to solve. Any help is appreciated.
I need to do a multi phase deduplication on query time. To elaborate more, my data is shared between multiple tenants and for each tenant I have a ranking config that decides what the top hit for field cv_number should be by origin_id . Meaning if I have a duplicate cv_number I need to pick whichever origin_id has precedence for that tenant. And top of these deduplicated docs, I need to run another deduplication by field source_key .This query currently runs on an SQL db. We use with clauses and basically group by - distinct alias that as phase_1 and group by - distinct again and paginate the result.
Originally, my plan was;
- use scripted sort
- collapse on field
cv_number - run terms query on top collapsed results with a top hits.
Now, I realize that I cannot run aggregations on collapsed result so my plan fell apart. Apart from that It might not be possible to use a scripted sort because then origin_id s with lower ranks would be sent to latest pages. We want ranks to effect only duplicated docs.