I'm going to build a system that can detect duplicate documents (with 10000-20000 words each document) for a small library project (like 1200-2000 documents). It's mostly like the SEo plagiarism tools but much bigger.
And I think i will going with ElasticSearch with MLT or Shingle Token Filter. How well is its performance, and is it suitable for my project?
Look forward to your advices,
Thank you all.