Plagiarism detection


(Dmitry) #1

Hi,

I am researching for the way of implementing the plagiarism detector using ES:

  1. we have a set of documents that are stored in ES;
  2. for a new document we need to check, if it contains the phrases that are presenting in some other existing documents - it means that some phrases were borrowed.

Is it possible to implement it somehow using any of standard ES mechanisms?
I have tried to play with shingles and MoreLikeThis but it seems not to be a right way.

The only solution that comes to my mind is extract shingles of length k (if we need to find borrowed phrases with the length >=k) and perform match_phrase in a loop. But it seems to be not really efficient.


Duplicate documents detection in Elasticsearch
How to find duplicate documents containing super long text fields?
(Jörg Prante) #2

You have to turn several words of your text into a bit pattern and encode this as a signature. Index the signatures in Elasticsearch.

For other texts, you repeat the process, and then you can count the number of matching signatures, instead of comparing a text word by word (or shingle by shingle).

The higher the matching count, the more likely plagiarism is.


(Dmitry) #3

Thanks a lot for the help.

Unfortunately, I don't have a lot of experience with such search strategies.

As I understand you suggest the following:

  1. For example we have the following text "aaa bbbbbb cc ddd eeeee ffff";
  2. We need to split the text to the following groups (for example we use 2-length groups): "aaa bbbbbb" ,"cc ddd" "eeeee ffff";
  3. Create a hashes of the groups: "hash1", "hash2", "hash3";
  4. store these hashes as array field: "hashes" : ["hash1", "hash2", "hash3"];
  5. how to find intersection of hashes among documents?

(Jörg Prante) #4

In reality, you will extract only key words from a text, and forget about stopwords.

Set up a granularity, i.e. a maximum number of features (key word sets) of your text. Something around 100 or 200 features will suffice in most cases (except you have very long texts with hundreds of pages).

You should not use hashes for the signatures, they are overkill. Use CRC or other simple checksum algorithms like Adler32, they are faster. There are collisions, but this does not harm the detection unless you rely only on a small feature count.

For the analysis, run a more like this query over the signatures, so the query returns the texts with most common signatures at the top.

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html


(Dmitry) #5

Jörg,

Thank you a lot for the good idea.

I have some concerns:

  1. It seems if I use the comparison based on the common words of texts it would find many "similar" text that has just the common topic, but not the plagiarism cases. Probable it is better to use, for example, 10-words shingles, calculate check sum of each shingle and try to use "more like this". Do you see any limitations?

  2. I don't have any experience with elastic search. could you please how to store these checksums? like array field?


(Jörg Prante) #6

It totally depends what your definition of plagiarism is.

To me, plagiarism is if you reuse a significant amount of equivalent key words even when you reorder them in a given context, say in a paragraph (without adding reference to original author).

Other definitions emphasize same word order. Then you have to keep words in the order of appearance before checksumming them.

You are free to adjust the checksumming context. This is what I mean by granularity.

In the end, it depends on whether you accept false hits or if you want to overlook candidates in your detection.

An index can contain documents with paragraph /section IDs and a set of checksums for a paragraph / section.

Checksums can be stored as hexadecimal strings, yes. There are no array fields in ES, I assume you mean multivalue fields, which is enabled by default. This is exactly what you need to do. Note, with more like this, you can not specify an ordered sequence of matching values. So you have to encode this information in the checksums.


(system) #7