I'm going to build a system that can detect duplicate documents (with 10000-20000 words each document) for a small library project (like 1200-2000 documents). It's mostly like the SEo plagiarism tools but much bigger.
And I think i will going with ElasticSearch with MLT or Shingle Token Filter. How well is its performance, and is it suitable for my project?
Thank you so much Ivan,
I did take a look this plugin but look like it only support for small request. What i am looking for is example i want scan entire a book (maybe have 20 to 30 pages) with my entire library (1200 to 2000 books) to see what percent of them is the same (just like SEO plagiarsim but with local sources).
Is it possible with this plugin?
Yes, you can detect plagiarism, at client side. Use fingerprinting method and More-like-this to detect various degrees of document similarity.
You can easily detect duplicates, by just indexing checksum, like CRC-32 or Adler.
There is no plugin I know of and the reason is obvious, the scenarios and requirements are too difficult for a generic solution. You have to program the detection for yourself.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.