Multi phase deduplication on query-time

Alright so I think I have a somewhat interesting query problem to solve. Any help is appreciated.

I need to do a multi phase deduplication on query time. To elaborate more, my data is shared between multiple tenants and for each tenant I have a ranking config that decides what the top hit for field cv_number should be by origin_id . Meaning if I have a duplicate cv_number I need to pick whichever origin_id has precedence for that tenant. And top of these deduplicated docs, I need to run another deduplication by field source_key .This query currently runs on an SQL db. We use with clauses and basically group by - distinct alias that as phase_1 and group by - distinct again and paginate the result.

Originally, my plan was;

  1. use scripted sort
  2. collapse on field cv_number
  3. run terms query on top collapsed results with a top hits.

Now, I realize that I cannot run aggregations on collapsed result so my plan fell apart. Apart from that It might not be possible to use a scripted sort because then origin_id s with lower ranks would be sent to latest pages. We want ranks to effect only duplicated docs.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.