I have what seems to be a fairly common use case for MoreLikeThis: indexing
articles and running MLT queries to learn what other documents I have in the
index that are similar. I am a little confused about results of some of
these queries and was hoping this list would be able to educate me on what's
happening under the hood.
One relevant detail about my application is that I'm caching the results of
the MLT queries so that I can do an aggregate query later on that says "give
me documents that are like any of this set of documents". So I have a table
that maps documents to similar documents and also saves the score that came
out of the MLT query. I have two questions about the similarity data that
I've generated so far:
Why would the score change over time? For a given document, it seems
that if I store the similar documents and corresponding scores, after time
has passed and more documents have been added to the index, a fresh MLT
query for the same original document will return a different score than the
one I saved earlier for the same similar documents. If the text of neither
document changes, shouldn't the similarity score be calculated exactly the
same each time? Or, is it possible that my first MLT query was run at a time
that the documents were not "completely" indexed (if such a concept exists)
and thus incomplete data was used in the query?
(mostly to confirm my understanding) The score from a MLT result is not a
symmetric property, right? Document A might be similar to B with a score of
0.5, but document B might not even return A in its set of similar documents.
Indeed, I have some examples of this in my index where document B returns no
similar documents from a MLT query, when B shows up in the MLT query against
A. In this case, why is it that the index returns no documents similar to
B? There don't appear to be any query parameters [1] which would lower the
minimum threshold to return results.
There is not a lot of documentation for MoreLikeThis, even at the Lucene
layer, and the best I've found is Aaron Johnson's write-up [2]. Any help
you can provide will be greatly appreciated. Thanks in advance.
Why would the score change over time? For a given document, it seems
that if I store the similar documents and corresponding scores, after time
has passed and more documents have been added to the index, a fresh MLT
query for the same original document will return a different score than the
one I saved earlier for the same similar documents. If the text of neither
document changes, shouldn't the similarity score be calculated exactly the
same each time? Or, is it possible that my first MLT query was run at a time
that the documents were not "completely" indexed (if such a concept exists)
and thus incomplete data was used in the query?
MLT scores take into account all the documents in the index. Terms
that occur in both documents will raise the MLT score more if other
documents do not contain that term. So words like "the" would
automatically have least significance since it occurs across all
documents. Whereas words like "skirr", "epicaricacy" and
"schizothemia" would really boost the score between 2 documents, since
they would likely only occur in a few documents. As more documents are
added that contain these words the influence these terms have will
drop and so will the MLT score between the 2 documents, even though
those 2 documents have not changed. As Einstein said, "it's all
relative, man".
On Sun, Sep 11, 2011 at 5:49 PM, Phil Whelan phil123@gmail.com wrote:
Hi Brandon,
Why would the score change over time?
MLT scores take into account all the documents in the index. Terms
that occur in both documents will raise the MLT score more if other
documents do not contain that term. So words like "the" would
automatically have least significance since it occurs across all
documents. Whereas words like "skirr", "epicaricacy" and
"schizothemia" would really boost the score between 2 documents, since
they would likely only occur in a few documents. As more documents are
added that contain these words the influence these terms have will
drop and so will the MLT score between the 2 documents, even though
those 2 documents have not changed. As Einstein said, "it's all
relative, man".
Hi Phil,
This is exactly the type of response I had hoped for and it explains things
perfectly.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.