Treatment of low/high/missing terms in Common Terms Query


(Loren Siebert) #1

From the Common Terms Query docs: The common terms query divides the query terms into two groups: more important (ie low frequency terms) and less important (ie high frequency terms which would previously have been stopwords).

But what does it do with terms that do not exist in the corpus at all?

For example, if I have a movie corpus and I query The Matrix qwerty, I'd get the treated as high frequency, matrix treated as low frequency, but what about qwerty?


(Nik Everett) #2

Terms that don't exist are considered low frequency:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-queries/4.5.0/org/apache/lucene/queries/CommonTermsQuery.java#188


(system) #3