Alright so I think I have a somewhat interesting query problem to solve. Any help is appreciated.
I need to do a multi phase deduplication on query time. To elaborate more, my data is shared between multiple tenants and for each tenant I have a ranking config that decides what the top hit for field cv_number
should be by origin_id
. Meaning if I have a duplicate cv_number
I need to pick whichever origin_id
has precedence for that tenant. And top of these deduplicated docs, I need to run another deduplication by field source_key
.This query currently runs on an SQL db. We use with
clauses and basically group by - distinct
alias that as phase_1
and group by - distinct
again and paginate the result.
Originally, my plan was;
- use scripted sort
- collapse on field
cv_number
- run terms query on top collapsed results with a top hits.
Now, I realize that I cannot run aggregations on collapsed result so my plan fell apart. Apart from that It might not be possible to use a scripted sort because then origin_id
s with lower ranks would be sent to latest pages. We want ranks to effect only duplicated docs.