Near duplicate document detection

tallison · July 15, 2020, 4:36pm

I have an index w ~12 million files. We have a serious duplication issue and near duplication issue.

The exact duplicate problem is trivial because we store a digest. The near duplicate issue is more profound, and I've read the several posts on discuss.elastic.co that deal with near duplicates.

I reindexed with a min_hash filter with default 512 buckets and other defaults, and I stored term_vectors for that field on the theory that I'd improve search speed. Reindexing time and extra required space were both impressive...no problems there.

When I run a MoreLikeThisQuery, the performance is not great (200ms up to 45 seconds per query), even if I limit the terms to the top 5. I saw similar performance on MLT with the straight content field (with no termvectors).

I got much better performance when I programmatically created my own Boolean AND with a subset of the terms stored in the termvector in the min hash field, but then I had to run jaccard on the matches, which can be expensive given the number of near duplicates we have.

Are there other, more efficient strategies you'd recommend for finding near duplicates?

My next thought is to randomly select strings of five words and submit a couple of phrase queries...

Thank you!

Mark_Harwood · July 15, 2020, 5:35pm

Hi Tim,
I was quite pleased with the encoding I used in the significant_text aggregation's near-duplicate filter.
It's used in-memory on result streams but I guess you could apply the same approach to indexing content signatures. I discuss it here.

system · August 12, 2020, 5:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Finding documents _almost_ the same Elasticsearch	5	2757	December 13, 2016
Duplicate documents detection in Elasticsearch Elasticsearch	4	2809	July 5, 2017
[RFC] idea for a near duplicate filter Elasticsearch	2	1264	July 6, 2017
Near duplicate detection using MinHash and approximated Jaccard score Elasticsearch	1	1368	April 11, 2019
Unclear minhash filter behavior in near-duplicate detection for short texts Elasticsearch docker	1	22	October 9, 2024

Near duplicate document detection

Related topics