When rewriting multiterm query, add constant_score to every term, not to the whole query [SOLVED]

I am looking for cities from geonames db. A typical search string would be "San Francisco CA". I have documents that have a city and a state field. I do a match query, matching search string to city and state, then combine these matches using bool:

"query" : {
    "bool" : {
        "must" : {
            "match" : {
                "country" : {
                    "query" : "San Francisco CA"
                }
            }
        },
        "should" : {
            "match" : {
                "city" : {
                    "query" : "San Francisco CA"
                }
            }
        }
    }
}

I have these two documents in my db:

{"city" : "San Francisco", "state" : "CA"}
{"city" : "San Marino", "state" : "San Marino"}

Problem is that matching "san" to San Marino's state scores much higher than matching CA to San Francisco's state, because there are many cities with state "CA" and very little cities with state "San Marino".

I try to disable IDF using constant_score, but that leads to another problem: matching "San Francisco CA" to "San Francisco" where two terms match gets the same score as matching "San Francisco CA" to "San Marino" where only one term matches. When a multiterm match query is being rewritten into separate terms, is it possible to constant_score each one of the rewritten queries, so that I get score of 2 for matching "San Francisco" and a score of 1 for matching just "San"?

UPDATE (SOLUTION):
Working example of custom similarity class:

Match queries analyze the query and turn the query into a OR boolean query of the resulting tokens, ie something like

[san] OR [francisco] OR [ca]

Boolean queries further bias results that match more criteria over those that match few pieces of criteria. This is done through the coordinating factor. The coordinating factor is a score multiple that punishes scores that don't meet all the criteria. So just by using the constant_score query instead of match (assuming that's what you mean) you'll end up losing the benefit of the coordinating factor.

So you WANT the coordinating factor WITHOUT the bias of IDF screwing everything up.

Option One -- parse the query string yourself, interweave constant_score & boolean

The simplest thing to do is to break up your query into should clauses that wrap constant_score queries. Something like this for just your city query (untested!)

      "bool": {
         "should" : [
              {
                  "constant_score": {
                       "match": {
                           "city": "San"
                       }
                  }
             },
              {
                  "constant_score": {
                       "match": {
                           "city": "Francisco"
                       }
                  }
             },
              {
                  "constant_score": {
                       "match": {
                           "city": "CA"
                       }
                  }
             }
      ]
}

Option Two -- Disable IDF with Custom Similarity

You can force the search engine to not calculate IDF using a custom similarity plugin. A Similarity controls how these statistics are computed at index/query time. In fact this starter plugin example actually just disables IDF by returning 1.0f here. You could use that directly! Once you did that, your existing match queries ought to work.

Look at multi match query for querying multiple fields and the "cross field" type parameter to do sensible things with IDF

Good call. Though that won't disable IDF, it will just blend them between the fields. Yet that might be good enough to solve this problem.

Thank you very much!

Implementing my a custom similarity plugin solved my problem. It's worthy to note that the example is outdated, so some changes to the code were necessary to get it to work with ES 1.7.

Though that won't disable IDF, it will just blend them between the fields.

It's not exactly a blend of IDF as you describe here [1].
It actually introduces a minor bias towards what it considers the "correct" field.
The blended DF used is max of the fields but is +1ed for the less-likely fields to make them rank lower.

[1]Elasticsearch Cross Field Search Is A Lie - OpenSource Connections

@beowulfenator would you be willing to open a pull request on the Elasticsearch repository to fix the example for 1.7?

But that example is not part of ES repository, right?

Yeah good call. That'd probably work in this case :smile: and moreover, it picks a winner via dismax, so that should push the score to the appropriate field. Probably the first thing to try.

Whoever wrote that blog post is some kind of smarty pants :stuck_out_tongue:

Ah yes, sorry. I though it was in the main documentation but its not :slight_smile: