We're running into this problem and I'm musing about several solutions, please join me in that process
We have about 25+ million documents relating to videos and I want to run More Like This, based on 1 document. Problem is: there are several documents pointing to the same video but it's hard to see because:
the words in the title aren't exactly the same (but mostly contain the same words)
the words in the description aren't exactly the same (but mostly contain the same words)
the url is definitely not the same
So what I need is More Like This, but actually also a bit Different Than This
There's several routes I was thinking of to eliminate this problem:
Exclude the top X results/top X% of results (because they're so similar, it's hard to image they're not actually the same video)
Calculate some kind of "hash value" over several fields (still no idea how but let's say it's possible) and then filter out results that are too close to that value.
My guts tells me Elasticsearch should be suitable for this problem but I can't piece it together just yet. Suggestions are very welcome.
I think what you are describing is often referred to as the "Near Duplicate Detection" problem in the literature. I was involved in a project that made good experiences with shingling approaches similar to the one described in A. Broder "Filtering near-duplicate documents".
Having said that, which part are you missing about "More Like This"? I'd imagine getting the top N MLT documents and then computing some simple set similarity (e.g. on the term vector of specific fields) will get you some way.
Impressive terminology that indeed seems to be what I'm looking for; that alone is worth a lot, thanks I'll do some research on Near Duplicate Detection. Could you maybe point me in the right direction for:
computing some simple set similarity (e.g. on the term vector of specific fields)
?
Also, are any of the methods to find NDD already implemented in Elasticsearch so that, on index time, I can already determine which documents look alike?
[update] It seems that simhash would be a good way to do this. I could already put the simhash in when I index the document and on search time, use a filter that compares it to the document at hand
Unfortunately I'm not aware of any such off-the-shelf solution but thinking in these directions seems very valuable to me. Most implementations of something working that I know of are a bit domain-specific though, so I think coming up with a general solution is difficult. Providing better tooling to implement this on top off would be great, I agree.
I was thinking about something simple like getting the term information for short fields like the title for the top N results in your application using e.g. the term vectors api in your own application. Then you could compute something like the Jaccard Index of those vectors and discard documents that are too similar e.g. to the top ranking one.
What makes doing this at index time problematic is the cost of having to effiently find good candidates for NDD matches for every new document. So to efficiently do this usually some small "fingerprint" of each document is computed (via shingling, hashing) that can then be used to limit the search scope for potential NDD matches.
Its an interesting problem that I'd like to give you more conrete advice on, but most of the solutions are somewhat application specific. Maybe others chiming in on this thread have other ideas?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.