I was actually excited to dig into an experiment at work using the common terms query. I see however, it's been deprecated as the performance benefits are seen now in the match query, as mentioned in this issue.
However, this query provides relevance functionality I don't think I can get in a match query (please correct me if I'm wrong). And the reason for deprecation seems to relate to max score / block WAND.
I've never understood common_terms to be about speed, but more about relevance. I was surprised this didn't come up in the deprecation discussions. Specifically, what I like about it is to be able to set a cutoff frequency, and make low or high value terms mandatory (default AND) or not mandatory (default OR) based on document frequency.
Am I missing where I can do this with other functionality now?
Thanks @joshdevins -> yeah optimizing BM25 and min-should-match params have been a big focus of our work. However, with min should match, we can't choose which tokens to remove. It just provides a hard floor on number of tokens. For example in a product search, if someone searches for "blue suede jacket" I'd prefer if "blue" was made optional (presumably high DF) but "suede" and "jacket" mandatory. So for the next round of experimentation, I had hoped to turn to common_terms as opposed to maintaining or computing a specific list of these low value terms outside the search engine.
I always looked at this from a different angle. The question is not whether a term should be mandatory or not but rather how much it contributes to the overall score. The dance with minimum_should_match, common_terms and all these advanced options is there to minimize the impact of always searching all terms. So when WAND was introduced, I thought that it could be a chance to simplify things further. Users can opt for the default behavior of considering all terms optional and at the same time rely on internal optimizations to ensure that we'll not consider all documents eagerly.
I always found minimum_should_match and common_terms difficult to approach. Finding the right configuration is tricky and whatever you find must be updated as the data and queries evolve. So that's a lot of burden for users that "just" want search to surface the most relevant results automatically.
The specific case I'm working on is if you search for 'blue suede jacket', and let's say there's only 2 exact matches, you'll see 'blue AND suede AND jacket' matches. If below that two, there's spurious matches on blue or suede, then even above the fold, you'll see some irrelevant results.
I was hoping to use common_terms to be a little smarter about making some terms mandatory (like jacket) and others optional (blue, suede, perhaps), so some of these lower down results would drop off and not be shown at all.
Of course if there's other ideas on how to tamp down the recall a bit, I'd be open to it... I know I could reissue the query, relaxing it strategically, or do a bit more before the query hits Elasticsearch. But I don't have access to doc freq and other useful index stats in a search service which would be useful in making this decision.
Hope that context makes sense. Anyway, it's up to you guys whether you want to deprecate it, I just wanted to share perhaps one use case where common_terms seems to help with relevance!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.