I am using ES 5.6.2 to compare pargraphs of two very similar documents ORIG and CLONE (where CLONE was cloned from the document ORIG, heavily modified).
My solution has been as follows:
- put the paragraphs of both documents in a same index, with a field "fromDocument" that is set to the name of the document the paragraph comes from.
- I then loop through all the fragments of ORIG, and do a MLT search with that paragraph's content, to find the most similar paragraphs in the index
The problem with this approach is that the MLT search returns similar paragraphs from both ORIG and CLONE, and I only want paragraphs from CLONE.
I could of course do some post processing to eliminate the paragraphs that come from ORIG, but I fear that in some cases, all the top hits returned by MLT might come from ORIG. In which case, I would never give the similar paragraphs in CLONE that might have appeared passed that poing.
I was looking for a way to combined an MLT search with a boolean filter, but it seems MLT does not support a "filter" field.
Any thoughts on how to do this?