Finding documents _almost_ the same

cbuescher · November 15, 2016, 11:27am

Hi,

I think what you are describing is often referred to as the "Near Duplicate Detection" problem in the literature. I was involved in a project that made good experiences with shingling approaches similar to the one described in A. Broder "Filtering near-duplicate documents".

Having said that, which part are you missing about "More Like This"? I'd imagine getting the top N MLT documents and then computing some simple set similarity (e.g. on the term vector of specific fields) will get you some way.

Topic		Replies	Views
Find similar records through MLT from millions records Elasticsearch	1	319	January 24, 2019
How to find Similar documents Elasticsearch	4	2646	July 5, 2017
Search for similar documents Elasticsearch	4	1873	July 6, 2017
[RFC] idea for a near duplicate filter Elasticsearch	2	1295	July 6, 2017
Duplicate documents detection in Elasticsearch Elasticsearch	4	2857	July 5, 2017

Finding documents _almost_ the same

Related topics