How to find duplicate documents containing super long text fields?

More detailed discussion of the techniques here: Plagiarism detection