Hello, I was going though "more_like_this" clause but not able to find relevant documents. I have below data in ElasticSearch and "description" field is having huge non-indexed data of size >1 million bytes. Like below I have ten thousand documents. How can I figure out a set of documents which are matching at least 80% with each other:
{
"_index": "school",
"_type": "book",
"_id": "1",
"_source": {
"title": "How to drive safely",
"description": "The book is written to help readers about giving driving safety guidelines. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. LONG...."
}
}
At the end, I am looking for list of document ID's which have at least 80% matching contents. Possible expected result containing matching document IDs (any format is fine):
[ [1,30, 500, 8000], [2, 40, 199], .... ]
Do I need to write batch and compare each document with all others and build output set?
Please guide.