Near duplicate document detection

I have an index w ~12 million files. We have a serious duplication issue and near duplication issue.

The exact duplicate problem is trivial because we store a digest. The near duplicate issue is more profound, and I've read the several posts on that deal with near duplicates.

I reindexed with a min_hash filter with default 512 buckets and other defaults, and I stored term_vectors for that field on the theory that I'd improve search speed. Reindexing time and extra required space were both problems there.

When I run a MoreLikeThisQuery, the performance is not great (200ms up to 45 seconds per query), even if I limit the terms to the top 5. I saw similar performance on MLT with the straight content field (with no termvectors).

I got much better performance when I programmatically created my own Boolean AND with a subset of the terms stored in the termvector in the min hash field, but then I had to run jaccard on the matches, which can be expensive given the number of near duplicates we have.

Are there other, more efficient strategies you'd recommend for finding near duplicates?

My next thought is to randomly select strings of five words and submit a couple of phrase queries...

Thank you!

1 Like

Hi Tim,
I was quite pleased with the encoding I used in the significant_text aggregation's near-duplicate filter.
It's used in-memory on result streams but I guess you could apply the same approach to indexing content signatures. I discuss it here.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.