Treatment of low/high/missing terms in Common Terms Query

loren · May 11, 2015, 5:58pm

From the Common Terms Query docs: The common terms query divides the query terms into two groups: more important (ie low frequency terms) and less important (ie high frequency terms which would previously have been stopwords).

But what does it do with terms that do not exist in the corpus at all?

For example, if I have a movie corpus and I query The Matrix qwerty, I'd get the treated as high frequency, matrix treated as low frequency, but what about qwerty?

nik9000 · May 11, 2015, 6:20pm

Terms that don't exist are considered low frequency:
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-queries/4.5.0/org/apache/lucene/queries/CommonTermsQuery.java#188

Topic		Replies	Views
Common Terms Query Corpus Question Elasticsearch	2	928	September 26, 2017
Common terms query with cutoff_frequency Elasticsearch	6	2321	July 6, 2017
Surprised by deprecation of common_terms query. What about its relevance features? Elasticsearch	5	712	May 18, 2021
Using Cutoff Frequency in a Multi-Match Query Causes Irrelevant Results Elasticsearch	2	569	February 14, 2019
How do Elasticsearch calculate term freq when using CutoffFrequency in CommonTermsQuery? Elasticsearch	2	813	November 30, 2018

Treatment of low/high/missing terms in Common Terms Query

Related topics