Prevent duplicate content in search

Yurii · July 4, 2018, 10:30am

Hey! I faced a problem when I tried to get a related post. There are few posts which are identical to each other or very similar.
So when I try to get the related posts there some cases when in a related post are few duplicated posts which have identical content relatively each other.
How I can solve this problem?

cbuescher · July 7, 2018, 9:06am

There are a few options here:

try to detect posts are identical at index time and discard the exact-duplicates. For this ppl. usually do some sort identity fingerprinting (like md5 hashing or similar) on the content and either compare those hashes to already seen hashed in a fast in-memory datastructure. If that is not feasible, you could store that info in Elasticsearch itself but then would have to do an additional query to check if the hash/document is already present
you could do the same as above but with a batch job, indexing the fingerprint value, periodically scan for duplicates and discard the extra ones

Thats a harder problem. There is a whole research area dedicated to find near-duplicates. What counts as "very similar" is also very domain specific. You might get away with simple fuzzy searches, but usually spotting this involves more complicated methods, mostly shingling algorithms. In a project I once did we implemented an algorithm by Broder, but there might be other aproaches. No silver bullets here as far as I know...

system · August 4, 2018, 9:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Find Duplicate candidates (similar) articles Elasticsearch	1	477	September 27, 2017
Finding documents _almost_ the same Elasticsearch	5	2757	December 13, 2016
[RFC] idea for a near duplicate filter Elasticsearch	2	1264	July 6, 2017
Duplicate documents detection in Elasticsearch Elasticsearch	4	2809	July 5, 2017
Near duplicate document detection Elasticsearch	2	1396	August 12, 2020

Prevent duplicate content in search

Related topics