How exactly does max_query_terms works in a MLT query

As the official documentation states, max_query_terms is the MAX number of query terms that will be selected. I would like to know how this works.

For example if I set max_query_terms = 6, then using the explain parameter, I can see that for some results, only 2 or 3 words were used to calculate the BM25 score. I really want to understand how this works and when ES decides to use only 2 words or the max number I defined?

If you'd like to poke around the code, that limit is used here: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/common/lucene/search/XMoreLikeThis.java#L728

My understanding is that the query will look up the term frequencies for a particular field, then construct a priority queue that is sized to either max_query_terms or the number of terms in the field, whichever is smaller. The priority queue determines which terms are added to the final boolean query that MLT creates.

So if a field only has two different values across all the docs (on the shard), the queue will be sized to 2 rather than max_query_terms. Fields with higher cardinality will bump into the limit instead and so the priority queue will be limited in size.

*Caveat: not an expert at MLT, so grain of salt :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.