As the official documentation states, max_query_terms is the MAX number of query terms that will be selected. I would like to know how this works.
For example if I set max_query_terms = 6, then using the explain parameter, I can see that for some results, only 2 or 3 words were used to calculate the BM25 score. I really want to understand how this works and when ES decides to use only 2 words or the max number I defined?
My understanding is that the query will look up the term frequencies for a particular field, then construct a priority queue that is sized to either max_query_terms or the number of terms in the field, whichever is smaller. The priority queue determines which terms are added to the final boolean query that MLT creates.
So if a field only has two different values across all the docs (on the shard), the queue will be sized to 2 rather than max_query_terms. Fields with higher cardinality will bump into the limit instead and so the priority queue will be limited in size.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.