Prevent duplicate content in search

Hey! I faced a problem when I tried to get a related post. There are few posts which are identical to each other or very similar.
So when I try to get the related posts there some cases when in a related post are few duplicated posts which have identical content relatively each other.
How I can solve this problem?

There are a few options here:

  • try to detect posts are identical at index time and discard the exact-duplicates. For this ppl. usually do some sort identity fingerprinting (like md5 hashing or similar) on the content and either compare those hashes to already seen hashed in a fast in-memory datastructure. If that is not feasible, you could store that info in Elasticsearch itself but then would have to do an additional query to check if the hash/document is already present
  • you could do the same as above but with a batch job, indexing the fingerprint value, periodically scan for duplicates and discard the extra ones

Thats a harder problem. There is a whole research area dedicated to find near-duplicates. What counts as "very similar" is also very domain specific. You might get away with simple fuzzy searches, but usually spotting this involves more complicated methods, mostly shingling algorithms. In a project I once did we implemented an algorithm by Broder, but there might be other aproaches. No silver bullets here as far as I know...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.