I have an index w ~12 million files. We have a serious duplication issue and near duplication issue.
The exact duplicate problem is trivial because we store a digest. The near duplicate issue is more profound, and I've read the several posts on discuss.elastic.co that deal with near duplicates.
I reindexed with a min_hash filter with default 512 buckets and other defaults, and I stored term_vectors for that field on the theory that I'd improve search speed. Reindexing time and extra required space were both impressive...no problems there.
When I run a MoreLikeThisQuery, the performance is not great (200ms up to 45 seconds per query), even if I limit the terms to the top 5. I saw similar performance on MLT with the straight content field (with no termvectors).
I got much better performance when I programmatically created my own Boolean AND with a subset of the terms stored in the termvector in the min hash field, but then I had to run jaccard on the matches, which can be expensive given the number of near duplicates we have.
Are there other, more efficient strategies you'd recommend for finding near duplicates?
My next thought is to randomly select strings of five words and submit a couple of phrase queries...