Questions about MoreLikeThis

Hello,

I have what seems to be a fairly common use case for MoreLikeThis: indexing
articles and running MLT queries to learn what other documents I have in the
index that are similar. I am a little confused about results of some of
these queries and was hoping this list would be able to educate me on what's
happening under the hood.

One relevant detail about my application is that I'm caching the results of
the MLT queries so that I can do an aggregate query later on that says "give
me documents that are like any of this set of documents". So I have a table
that maps documents to similar documents and also saves the score that came
out of the MLT query. I have two questions about the similarity data that
I've generated so far:

  1. Why would the score change over time? For a given document, it seems
    that if I store the similar documents and corresponding scores, after time
    has passed and more documents have been added to the index, a fresh MLT
    query for the same original document will return a different score than the
    one I saved earlier for the same similar documents. If the text of neither
    document changes, shouldn't the similarity score be calculated exactly the
    same each time? Or, is it possible that my first MLT query was run at a time
    that the documents were not "completely" indexed (if such a concept exists)
    and thus incomplete data was used in the query?

  2. (mostly to confirm my understanding) The score from a MLT result is not a
    symmetric property, right? Document A might be similar to B with a score of
    0.5, but document B might not even return A in its set of similar documents.
    Indeed, I have some examples of this in my index where document B returns no
    similar documents from a MLT query, when B shows up in the MLT query against
    A. In this case, why is it that the index returns no documents similar to
    B? There don't appear to be any query parameters [1] which would lower the
    minimum threshold to return results.

There is not a lot of documentation for MoreLikeThis, even at the Lucene
layer, and the best I've found is Aaron Johnson's write-up [2]. Any help
you can provide will be greatly appreciated. Thanks in advance.

Cheers,
Brandon

[1] http://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html
[2] http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/

Hi Brandon,

  1. Why would the score change over time? For a given document, it seems
    that if I store the similar documents and corresponding scores, after time
    has passed and more documents have been added to the index, a fresh MLT
    query for the same original document will return a different score than the
    one I saved earlier for the same similar documents. If the text of neither
    document changes, shouldn't the similarity score be calculated exactly the
    same each time? Or, is it possible that my first MLT query was run at a time
    that the documents were not "completely" indexed (if such a concept exists)
    and thus incomplete data was used in the query?

MLT scores take into account all the documents in the index. Terms
that occur in both documents will raise the MLT score more if other
documents do not contain that term. So words like "the" would
automatically have least significance since it occurs across all
documents. Whereas words like "skirr", "epicaricacy" and
"schizothemia" would really boost the score between 2 documents, since
they would likely only occur in a few documents. As more documents are
added that contain these words the influence these terms have will
drop and so will the MLT score between the 2 documents, even though
those 2 documents have not changed. As Einstein said, "it's all
relative, man".

You can learn more here...
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/search/Similarity.html

Cheers,
Phil

On Sun, Sep 11, 2011 at 5:49 PM, Phil Whelan phil123@gmail.com wrote:

Hi Brandon,

  1. Why would the score change over time?

MLT scores take into account all the documents in the index. Terms
that occur in both documents will raise the MLT score more if other
documents do not contain that term. So words like "the" would
automatically have least significance since it occurs across all
documents. Whereas words like "skirr", "epicaricacy" and
"schizothemia" would really boost the score between 2 documents, since
they would likely only occur in a few documents. As more documents are
added that contain these words the influence these terms have will
drop and so will the MLT score between the 2 documents, even though
those 2 documents have not changed. As Einstein said, "it's all
relative, man".

Hi Phil,

This is exactly the type of response I had hoped for and it explains things
perfectly.

You can learn more here...

Similarity (Lucene 3.0.3 API)

I had not seen this document. I'll read up presently. Thanks for taking
the time.

Cheers,
Phil

Cheers,
Brandon