Hi,
I have a very basic ES-Setting. All items have just two fields id and content.
I want to find the top 20 (or 100) most similar documents for each document in my index by getting their BM25 Score. My understanding is, that this can be achieved by issuing MLT-queries. However, for some documents, I receive less than 20 results, for some even zero. But shouldn't each and every document receive a score, regardless of how poorly it is? Furthermore, I know that there are fairly similar documents in my dataset. So finding 0 or just 4 which are deemed similar is definitely not the answer that I was looking for.
To conclude: I want to have the top 20 BM25 Scores for all items in my index regarding the content field. Right now my query looks like this:
MLT queries have a number of default settings designed to trim the long tail of low quality matches.
These are term-selection settings and query formulation parameters - one example of which is the parameter that requires matching docs to have 30% of selected terms.
Thank you so much! I know it seems like it didn't read the documentation but I assure you I did. I must've overlooked this.
As of curiosity, do you have an Idea why 30 is the default value. It seems that especially for larger strings, it's not reasonable to assume that similar documents share that many words.
My query works when I set minimum_should_match to 5% and the highest match has a score of 48.
It's a precision/recall balancing act which is always tough.
If you are too lax and go for recall, yes the top-matching docs will still be the most relevant (containing most of the rarer terms) but you'll have a long tail of garbage which means that any aggregations might be summarising surprisingly irrelevant things.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.