I am looking for cities from geonames db. A typical search string would be "San Francisco CA". I have documents that have a city and a state field. I do a match query, matching search string to city and state, then combine these matches using bool:
Problem is that matching "san" to San Marino's state scores much higher than matching CA to San Francisco's state, because there are many cities with state "CA" and very little cities with state "San Marino".
I try to disable IDF using constant_score, but that leads to another problem: matching "San Francisco CA" to "San Francisco" where two terms match gets the same score as matching "San Francisco CA" to "San Marino" where only one term matches. When a multiterm match query is being rewritten into separate terms, is it possible to constant_score each one of the rewritten queries, so that I get score of 2 for matching "San Francisco" and a score of 1 for matching just "San"?
UPDATE (SOLUTION):
Working example of custom similarity class:
Match queries analyze the query and turn the query into a OR boolean query of the resulting tokens, ie something like
[san] OR [francisco] OR [ca]
Boolean queries further bias results that match more criteria over those that match few pieces of criteria. This is done through the coordinating factor. The coordinating factor is a score multiple that punishes scores that don't meet all the criteria. So just by using the constant_score query instead of match (assuming that's what you mean) you'll end up losing the benefit of the coordinating factor.
So you WANT the coordinating factor WITHOUT the bias of IDF screwing everything up.
Option One -- parse the query string yourself, interweave constant_score & boolean
The simplest thing to do is to break up your query into should clauses that wrap constant_score queries. Something like this for just your city query (untested!)
You can force the search engine to not calculate IDF using a custom similarity plugin. A Similarity controls how these statistics are computed at index/query time. In fact this starter plugin example actually just disables IDF by returning 1.0fhere. You could use that directly! Once you did that, your existing match queries ought to work.
Implementing my a custom similarity plugin solved my problem. It's worthy to note that the example is outdated, so some changes to the code were necessary to get it to work with ES 1.7.
Though that won't disable IDF, it will just blend them between the fields.
It's not exactly a blend of IDF as you describe here [1].
It actually introduces a minor bias towards what it considers the "correct" field.
The blended DF used is max of the fields but is +1ed for the less-likely fields to make them rank lower.
Yeah good call. That'd probably work in this case and moreover, it picks a winner via dismax, so that should push the score to the appropriate field. Probably the first thing to try.
Whoever wrote that blog post is some kind of smarty pants
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.