How to index short sentences for similarity search?

I have a huge dataset, each document is the dataset contains some lines of short sentences.

My problem is: Given a document, I need search similar documents based on the threshold of how many percentage of short sentence are same. For example, if the threshold is 25%, then if the 25% of short sentences are same in two documents, they are thought similar.

My question is:
How should index the documents, and what similarity algorithm should be used? Thanks in advance for any suggestions and feedbacks.

If you need exact match of such short sentences, you may use keyword fields and More like this query on that field. How about it??

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.