When rewriting multiterm query, add constant_score to every term, not to the whole query [SOLVED]

beowulfenator · September 22, 2015, 7:31pm

I am looking for cities from geonames db. A typical search string would be "San Francisco CA". I have documents that have a city and a state field. I do a match query, matching search string to city and state, then combine these matches using bool:

"query" : {
    "bool" : {
        "must" : {
            "match" : {
                "country" : {
                    "query" : "San Francisco CA"
                }
            }
        },
        "should" : {
            "match" : {
                "city" : {
                    "query" : "San Francisco CA"
                }
            }
        }
    }
}

I have these two documents in my db:

{"city" : "San Francisco", "state" : "CA"}
{"city" : "San Marino", "state" : "San Marino"}

Problem is that matching "san" to San Marino's state scores much higher than matching CA to San Francisco's state, because there are many cities with state "CA" and very little cities with state "San Marino".

I try to disable IDF using constant_score, but that leads to another problem: matching "San Francisco CA" to "San Francisco" where two terms match gets the same score as matching "San Francisco CA" to "San Marino" where only one term matches. When a multiterm match query is being rewritten into separate terms, is it possible to constant_score each one of the rewritten queries, so that I get score of 2 for matching "San Francisco" and a score of 1 for matching just "San"?

UPDATE (SOLUTION):
Working example of custom similarity class:

softwaredoug · September 23, 2015, 1:35am

Match queries analyze the query and turn the query into a OR boolean query of the resulting tokens, ie something like

[san] OR [francisco] OR [ca]

Boolean queries further bias results that match more criteria over those that match few pieces of criteria. This is done through the coordinating factor. The coordinating factor is a score multiple that punishes scores that don't meet all the criteria. So just by using the constant_score query instead of match (assuming that's what you mean) you'll end up losing the benefit of the coordinating factor.

So you WANT the coordinating factor WITHOUT the bias of IDF screwing everything up.

Option One -- parse the query string yourself, interweave constant_score & boolean

The simplest thing to do is to break up your query into should clauses that wrap constant_score queries. Something like this for just your city query (untested!)

      "bool": {
         "should" : [
              {
                  "constant_score": {
                       "match": {
                           "city": "San"
                       }
                  }
             },
              {
                  "constant_score": {
                       "match": {
                           "city": "Francisco"
                       }
                  }
             },
              {
                  "constant_score": {
                       "match": {
                           "city": "CA"
                       }
                  }
             }
      ]
}

Option Two -- Disable IDF with Custom Similarity

You can force the search engine to not calculate IDF using a custom similarity plugin. A Similarity controls how these statistics are computed at index/query time. In fact this starter plugin example actually just disables IDF by returning 1.0f here. You could use that directly! Once you did that, your existing match queries ought to work.

Mark_Harwood · September 23, 2015, 6:59am

Look at multi match query for querying multiple fields and the "cross field" type parameter to do sensible things with IDF

softwaredoug · September 23, 2015, 12:56pm

Good call. Though that won't disable IDF, it will just blend them between the fields. Yet that might be good enough to solve this problem.

beowulfenator · September 23, 2015, 1:03pm

Thank you very much!

Implementing my a custom similarity plugin solved my problem. It's worthy to note that the example is outdated, so some changes to the code were necessary to get it to work with ES 1.7.

Mark_Harwood · September 23, 2015, 2:12pm

Though that won't disable IDF, it will just blend them between the fields.

It's not exactly a blend of IDF as you describe here [1].
It actually introduces a minor bias towards what it considers the "correct" field.
The blended DF used is max of the fields but is +1ed for the less-likely fields to make them rank lower.

[1]Elasticsearch Cross Field Search Is A Lie - OpenSource Connections

colings86 · September 23, 2015, 2:28pm

@beowulfenator would you be willing to open a pull request on the Elasticsearch repository to fix the example for 1.7?

beowulfenator · September 23, 2015, 3:07pm

But that example is not part of ES repository, right?

softwaredoug · September 23, 2015, 3:13pm

Yeah good call. That'd probably work in this case and moreover, it picks a winner via dismax, so that should push the score to the appropriate field. Probably the first thing to try.

Whoever wrote that blog post is some kind of smarty pants

colings86 · September 23, 2015, 3:19pm

Ah yes, sorry. I though it was in the main documentation but its not

Topic		Replies	Views
Increasing relevance with additional matching terms matching, but with constant scores Elasticsearch	1	351	July 6, 2017
How to use filter using constant_score with another query bool Elasticsearch	2	1613	July 5, 2017
Using constant_score to avoid TF-IDF Elasticsearch	2	1459	November 9, 2018
Constant Score - Multiple matches Elasticsearch	3	1512	February 5, 2020
Elastic Search: How to set constant score for only one of the queries Elasticsearch	2	962	November 24, 2017

When rewriting multiterm query, add constant_score to every term, not to the whole query [SOLVED]

Option One -- parse the query string yourself, interweave constant_score & boolean

Option Two -- Disable IDF with Custom Similarity

Related topics