I have a huge dataset, each document is the dataset contains some lines of short sentences.
My problem is: Given a document, I need search similar documents based on the threshold of how many percentage of short sentence are same. For example, if the threshold is 25%, then if the 25% of short sentences are same in two documents, they are thought similar.
My question is:
How should index the documents, and what similarity algorithm should be used? Thanks in advance for any suggestions and feedbacks.