Getting similarity scores by issuing MLT-queries doesn't work for some documents

Hi,
I have a very basic ES-Setting. All items have just two fields id and content.

I want to find the top 20 (or 100) most similar documents for each document in my index by getting their BM25 Score. My understanding is, that this can be achieved by issuing MLT-queries. However, for some documents, I receive less than 20 results, for some even zero. But shouldn't each and every document receive a score, regardless of how poorly it is? Furthermore, I know that there are fairly similar documents in my dataset. So finding 0 or just 4 which are deemed similar is definitely not the answer that I was looking for.

To conclude: I want to have the top 20 BM25 Scores for all items in my index regarding the content field. Right now my query looks like this:

{
'query': 
    {'more_like_this': 
        {'fields': ['content'], 
        'like': 
            {'_index': 'war_stories', '_id': 85}, 
            'min_term_freq': 1, 'min_doc_freq': 1}
    }, 
'from': 0, 'size': 20}

The index has roughly 22000 Documents in one local shard.

Thanks for any insight.

MLT queries have a number of default settings designed to trim the long tail of low quality matches.
These are term-selection settings and query formulation parameters - one example of which is the parameter that requires matching docs to have 30% of selected terms.

All of these settings can be tweaked.

Thank you so much! I know it seems like it didn't read the documentation but I assure you I did. I must've overlooked this.

As of curiosity, do you have an Idea why 30 is the default value. It seems that especially for larger strings, it's not reasonable to assume that similar documents share that many words.
My query works when I set minimum_should_match to 5% and the highest match has a score of 48.

Again, thanks a lot!

1 Like

It's a precision/recall balancing act which is always tough.
If you are too lax and go for recall, yes the top-matching docs will still be the most relevant (containing most of the rarer terms) but you'll have a long tail of garbage which means that any aggregations might be summarising surprisingly irrelevant things.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.