Multi phase deduplication on query-time

osmanmesutozcan · June 3, 2021, 9:40am

Alright so I think I have a somewhat interesting query problem to solve. Any help is appreciated.

I need to do a multi phase deduplication on query time. To elaborate more, my data is shared between multiple tenants and for each tenant I have a ranking config that decides what the top hit for field cv_number should be by origin_id . Meaning if I have a duplicate cv_number I need to pick whichever origin_id has precedence for that tenant. And top of these deduplicated docs, I need to run another deduplication by field source_key .This query currently runs on an SQL db. We use with clauses and basically group by - distinct alias that as phase_1 and group by - distinct again and paginate the result.

Originally, my plan was;

use scripted sort
collapse on field cv_number
run terms query on top collapsed results with a top hits.

Now, I realize that I cannot run aggregations on collapsed result so my plan fell apart. Apart from that It might not be possible to use a scripted sort because then origin_id s with lower ranks would be sent to latest pages. We want ranks to effect only duplicated docs.

system · July 1, 2021, 9:41am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sort with dedupe? Elasticsearch	1	463	July 5, 2017
Deduplication filter? Elasticsearch	4	4830	July 6, 2017
Getting correct counts after aggregation? Elasticsearch	1	737	July 5, 2017
Using aggregation in Elasticsearch to find duplicate data in the same index Elasticsearch painless	6	15717	January 22, 2021
7.x How do remove duplicates in search after query? Elasticsearch	2	543	February 2, 2021

Multi phase deduplication on query-time

Related topics