We're running into this problem and I'm musing about several solutions, please join me in that process
We have about 25+ million documents relating to videos and I want to run More Like This, based on 1 document. Problem is: there are several documents pointing to the same video but it's hard to see because:
- the words in the title aren't exactly the same (but mostly contain the same words)
- the words in the description aren't exactly the same (but mostly contain the same words)
- the url is definitely not the same
So what I need is More Like This, but actually also a bit Different Than This
There's several routes I was thinking of to eliminate this problem:
- Exclude the top X results/top X% of results (because they're so similar, it's hard to image they're not actually the same video)
- Calculate some kind of "hash value" over several fields (still no idea how but let's say it's possible) and then filter out results that are too close to that value.
My guts tells me Elasticsearch should be suitable for this problem but I can't piece it together just yet. Suggestions are very welcome.