Getting similarity scores by issuing MLT-queries doesn't work for some documents

Moiddes · April 20, 2020, 1:59pm

Hi,
I have a very basic ES-Setting. All items have just two fields id and content.

I want to find the top 20 (or 100) most similar documents for each document in my index by getting their BM25 Score. My understanding is, that this can be achieved by issuing MLT-queries. However, for some documents, I receive less than 20 results, for some even zero. But shouldn't each and every document receive a score, regardless of how poorly it is? Furthermore, I know that there are fairly similar documents in my dataset. So finding 0 or just 4 which are deemed similar is definitely not the answer that I was looking for.

To conclude: I want to have the top 20 BM25 Scores for all items in my index regarding the content field. Right now my query looks like this:

{
'query': 
    {'more_like_this': 
        {'fields': ['content'], 
        'like': 
            {'_index': 'war_stories', '_id': 85}, 
            'min_term_freq': 1, 'min_doc_freq': 1}
    }, 
'from': 0, 'size': 20}

The index has roughly 22000 Documents in one local shard.

Thanks for any insight.

Mark_Harwood · April 20, 2020, 2:18pm

MLT queries have a number of default settings designed to trim the long tail of low quality matches.
These are term-selection settings and query formulation parameters - one example of which is the parameter that requires matching docs to have 30% of selected terms.

All of these settings can be tweaked.

Moiddes · April 20, 2020, 4:02pm

Thank you so much! I know it seems like it didn't read the documentation but I assure you I did. I must've overlooked this.

As of curiosity, do you have an Idea why 30 is the default value. It seems that especially for larger strings, it's not reasonable to assume that similar documents share that many words.
My query works when I set minimum_should_match to 5% and the highest match has a score of 48.

Again, thanks a lot!

Mark_Harwood · April 20, 2020, 4:06pm

It's a precision/recall balancing act which is always tough.
If you are too lax and go for recall, yes the top-matching docs will still be the most relevant (containing most of the rarer terms) but you'll have a long tail of garbage which means that any aggregations might be summarising surprisingly irrelevant things.

system · May 18, 2020, 4:06pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Compare multiple MLT queries score Elasticsearch	1	814	February 5, 2017
Questions about MoreLikeThis Elasticsearch	3	476	July 6, 2017
Normalizing MLT score Elasticsearch	1	360	July 6, 2017
More like this querying Elasticsearch	2	323	June 10, 2019
More like this scoring algorithm unclear Elasticsearch	5	2547	July 6, 2017

Getting similarity scores by issuing MLT-queries doesn't work for some documents

Related topics