I am researching for the way of implementing the plagiarism detector using ES:
- we have a set of documents that are stored in ES;
- for a new document we need to check, if it contains the phrases that are presenting in some other existing documents - it means that some phrases were borrowed.
Is it possible to implement it somehow using any of standard ES mechanisms?
I have tried to play with shingles and MoreLikeThis but it seems not to be a right way.
The only solution that comes to my mind is extract shingles of length k (if we need to find borrowed phrases with the length >=k) and perform match_phrase in a loop. But it seems to be not really efficient.