Duplicate documents detection in Elasticsearch

trungdk · December 26, 2015, 4:48pm

Hi all,

I'm going to build a system that can detect duplicate documents (with 10000-20000 words each document) for a small library project (like 1200-2000 documents). It's mostly like the SEo plagiarism tools but much bigger.
And I think i will going with ElasticSearch with MLT or Shingle Token Filter. How well is its performance, and is it suitable for my project?

Look forward to your advices,
Thank you all.

Ivan · December 26, 2015, 10:49pm

Take a look at this plugin, which offers more algorithm useful in data
duplication: https://github.com/YannBrrd/elasticsearch-entity-resolution

Ivan

trungdk · December 27, 2015, 12:37am

Thank you so much Ivan,
I did take a look this plugin but look like it only support for small request. What i am looking for is example i want scan entire a book (maybe have 20 to 30 pages) with my entire library (1200 to 2000 books) to see what percent of them is the same (just like SEO plagiarsim but with local sources).
Is it possible with this plugin?

jprante · December 27, 2015, 1:14am

Yes, you can detect plagiarism, at client side. Use fingerprinting method and More-like-this to detect various degrees of document similarity.

You can easily detect duplicates, by just indexing checksum, like CRC-32 or Adler.

There is no plugin I know of and the reason is obvious, the scenarios and requirements are too difficult for a generic solution. You have to program the detection for yourself.

See Plagiarism detection

Topic		Replies	Views
Plagiarism detection Elasticsearch	6	4827	July 5, 2017
MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates Elasticsearch	9	942	July 6, 2017
Similar documentation detection System Elasticsearch	6	438	July 6, 2017
Document Clustering Elasticsearch	3	1178	July 6, 2017
ANN : elasticsearch-entity-resolution plugin 0.1 Elasticsearch	6	756	July 6, 2017

Duplicate documents detection in Elasticsearch

Related topics