I have what seems to be a fairly common use case for MoreLikeThis: indexing
articles and running MLT queries to learn what other documents I have in the
index that are similar. I am a little confused about results of some of
these queries and was hoping this list would be able to educate me on what's
happening under the hood.
One relevant detail about my application is that I'm caching the results of
the MLT queries so that I can do an aggregate query later on that says "give
me documents that are like any of this set of documents". So I have a table
that maps documents to similar documents and also saves the score that came
out of the MLT query. I have two questions about the similarity data that
I've generated so far:
Why would the score change over time? For a given document, it seems
that if I store the similar documents and corresponding scores, after time
has passed and more documents have been added to the index, a fresh MLT
query for the same original document will return a different score than the
one I saved earlier for the same similar documents. If the text of neither
document changes, shouldn't the similarity score be calculated exactly the
same each time? Or, is it possible that my first MLT query was run at a time
that the documents were not "completely" indexed (if such a concept exists)
and thus incomplete data was used in the query?
(mostly to confirm my understanding) The score from a MLT result is not a
symmetric property, right? Document A might be similar to B with a score of
0.5, but document B might not even return A in its set of similar documents.
Indeed, I have some examples of this in my index where document B returns no
similar documents from a MLT query, when B shows up in the MLT query against
A. In this case, why is it that the index returns no documents similar to
B? There don't appear to be any query parameters  which would lower the
minimum threshold to return results.
There is not a lot of documentation for MoreLikeThis, even at the Lucene
layer, and the best I've found is Aaron Johnson's write-up . Any help
you can provide will be greatly appreciated. Thanks in advance.