A deeper understanding of term vectors and the more like this query


(Barry Baronas) #1

I am trying to gain a deeper understanding of the more_like_this query. As I understand it, the more_like_this query will derive term vectors on either the string fields specified, or on all of the string fields of a specified document or documents, and then use those terms as a query to find other documents in the index. Further, I understand that these terms are derived using tf/idf based on values found not only in the particular field, but also across that same field in each document containing that field.

First, is this a correct understanding of how the more_like_this query works?

Next, if that is the case, does that mean more_like_this essentially ignores other terms that may exist in the field, but did not make it into the max_query_terms?

Finally, assuming the above questions are the case, what are some suggestions on maximizing "quality" terms besides just providing a list of stopwords to drop out of the running? Should I leverage max_doc_freq in a manner similar to the cutoff_frequency in the common terms query?

I am very new to search and I am trying to gain some insight on how some of these query tools work.

Thank you.


(system) #2