Duplicate documents detection in Elasticsearch

(Đinh Khắc Trung) #1

Hi all,

I'm going to build a system that can detect duplicate documents (with 10000-20000 words each document) for a small library project (like 1200-2000 documents). It's mostly like the SEo plagiarism tools but much bigger.
And I think i will going with ElasticSearch with MLT or Shingle Token Filter. How well is its performance, and is it suitable for my project? :frowning:

Look forward to your advices,
Thank you all.

(Ivan Brusic) #2

Take a look at this plugin, which offers more algorithm useful in data
duplication: https://github.com/YannBrrd/elasticsearch-entity-resolution


(Đinh Khắc Trung) #3

Thank you so much Ivan,
I did take a look this plugin but look like it only support for small request. What i am looking for is example i want scan entire a book (maybe have 20 to 30 pages) with my entire library (1200 to 2000 books) to see what percent of them is the same (just like SEO plagiarsim but with local sources).
Is it possible with this plugin?

(Jörg Prante) #4

Yes, you can detect plagiarism, at client side. Use fingerprinting method and More-like-this to detect various degrees of document similarity.

You can easily detect duplicates, by just indexing checksum, like CRC-32 or Adler.

There is no plugin I know of and the reason is obvious, the scenarios and requirements are too difficult for a generic solution. You have to program the detection for yourself.

See Plagiarism detection

(system) #5