Multi-value boost not scored properly, need help


(John Smilanick) #1

I need some help getting my index to sort my results correctly. I am
using mongoid & the tire gem, but it is pretty straightforward to
understand its configuration.

tire.settings :analysis => {
:analyzer => {
:skill_analyzer => {
'tokenizer' => 'whitespace', 'filter' => ['lowercase'],
'type' => "custom"
},
:location_analyzer => {
'tokenizer' => 'whitespace', 'filter' => ['lowercase'],
'type' => "custom"
}
}
}

tire.mapping :_boost => {:name => '_boost', :null_value => 1.0} do
indexes :id, :index => :not_analyzed
indexes :skills, :analyzer =>
'skill_analyzer', :boost => 10.0, :omit_norms => true
indexes :location, :analyzer =>
'location_analyzer', :boost => 4.0, :omit_norms => true
indexes :country_code, :index => :not_analyzed
indexes :hireable, :index => :not_analyzed, :type =>
'integer'
end


Sample record:

{"_boost":3.8305967274524235,"skills":[{"_value":"Bourne
Shell","_boost":8.318875430559885},{"_value":"Scala","_boost":
16.01893049877409},{"_value":"shell linux unix debian freebsd openbsd
netbsd bsd gnu suse opensuse ubuntu red hat fedora gentoo
slackware","_boost":8.318875430559885}],"location":
[],"country_code":null,"hireable":0}

The offending field is 'skills' and when I perform a search on a skill
(say 'Scala') the scores for the results don't reflect the boost I
have given scala. You can see from the explain below that the first
result has a lower document boost and a lower scala boost, but still
appears at the top. Can anyone help me configure this index correctly?

curl -XGET "http://localhost:9200/profiles/profile/_search?
pretty=true" -d '{"explain":true,"query":{"bool":{"must":
[{"query_string":
{"query":"scala","default_operator":"OR","minimum_should_match":"60%"}}]}},"size":
20,"from":0}'

{
"took" : 179,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 744,
"max_score" : 8.4369335,
"hits" : [ {
"_shard" : 3,
"_node" : "mD28KAI2RV-XSHTNPrP2PQ",
"_index" : "profiles",
"_type" : "profile",
"_id" : "4f34412032dbdae92700b0d7",
"_score" : 8.4369335, "_source" : {"_boost":
3.8305967274524235,"skills":[{"_value":"Bourne Shell","_boost":
8.318875430559885},{"_value":"Scala","_boost":16.01893049877409},
{"_value":"shell linux unix debian freebsd openbsd netbsd bsd gnu suse
opensuse ubuntu red hat fedora gentoo slackware","_boost":
8.318875430559885}],"location":[],"country_code":null,"hireable":0},
"_explanation" : {
"value" : 8.436933,
"description" : "fieldWeight(_all:scala in 0), product of:",
"details" : [ {
"value" : 11.327094,
"description" : "btq, product of:",
"details" : [ {
"value" : 0.70710677,
"description" : "tf(phraseFreq=0.5)"
}, {
"value" : 16.01893,
"description" : "allPayload(...)"
} ]
}, {
"value" : 0.9931271,
"description" : "idf(_all: scala=145)"
}, {
"value" : 0.75,
"description" : "fieldNorm(field=_all, doc=0)"
} ]
}
}, {
"_shard" : 3,
"_node" : "mD28KAI2RV-XSHTNPrP2PQ",
"_index" : "profiles",
"_type" : "profile",
"_id" : "4f34a3ef248f4b066e002e46",
"_score" : 7.5881076, "_source" : {"_boost":
3.6149999999999998,"skills":[{"_value":"Bourne Again Shell","_boost":
1.0396052545923595},{"_value":"Bourne Shell","_boost":
10.453783509740283},{"_value":"Scala","_boost":17.288752227294154},
{"_value":"CSS","_boost":5.840023623437591},{"_value":"XML","_boost":
7.667998680185211},{"_value":"HTML","_boost":2.6905258048255556},
{"_value":"web","_boost":4.265274714131573},{"_value":"frontend front-
end","_boost":4.265274714131573},{"_value":"shell linux unix debian
freebsd openbsd netbsd bsd gnu suse opensuse ubuntu red hat fedora
gentoo slackware","_boost":5.7466943821663214}],"location":
[],"country_code":null,"hireable":0},
"_explanation" : {
"value" : 7.588108,
"description" : "fieldWeight(_all:scala in 4), product of:",
"details" : [ {
"value" : 12.224994,
"description" : "btq, product of:",
"details" : [ {
"value" : 0.70710677,
"description" : "tf(phraseFreq=0.5)"
}, {
"value" : 17.288752,
"description" : "allPayload(...)"
} ]
}, {
"value" : 0.9931271,
"description" : "idf(_all: scala=145)"
}, {
"value" : 0.625,
"description" : "fieldNorm(field=_all, doc=4)"
} ]
}
}

Can anyone help me configure this index correctly?


(John Smilanick) #2

I have reduced the problem to this example.

tire.mapping do
indexes :id, :index => :not_analyzed
indexes :skills, :type => 'string', :omit_term_freq_and_positions
=> true, :omit_norms => true
end

It sorts correctly when all the records have a single skill, e.g.

{"skills":[{"_value":"Scala","_boost":26.03926622472801}]} ==> score:
18.299232

The more 'skills' a record has the lower the fieldNorm value is hence
the lower the score. eg

{"skills":[{"_value":"Scala","_boost":26.03926622472801},
{"_value":"erlang","_boost":10}]} ==> 11.437021

It all comes down to the fieldNorm value which starts at 1 for 1 skill
and decreases for more skills. I thought omit_norms should do the
trick, but that doesn't seem to work for me. Any ideas?


(Shay Banon) #3

Have you tried to simply boost at query time? I recommend using query time
boosting (either using boost on teh query itself, or using
custom_boost_factor wrapper).

On Sat, Mar 24, 2012 at 1:43 AM, John Smilanick jsmilanick@gild.com wrote:

I have reduced the problem to this example.

tire.mapping do
indexes :id, :index => :not_analyzed
indexes :skills, :type => 'string', :omit_term_freq_and_positions
=> true, :omit_norms => true
end

It sorts correctly when all the records have a single skill, e.g.

{"skills":[{"_value":"Scala","_boost":26.03926622472801}]} ==> score:
18.299232

The more 'skills' a record has the lower the fieldNorm value is hence
the lower the score. eg

{"skills":[{"_value":"Scala","_boost":26.03926622472801},
{"_value":"erlang","_boost":10}]} ==> 11.437021

It all comes down to the fieldNorm value which starts at 1 for 1 skill
and decreases for more skills. I thought omit_norms should do the
trick, but that doesn't seem to work for me. Any ideas?


(John Smilanick) #4

neither boosting the query nor custom_boost_factor do not solve the problem:

curl -XGET "http://localhost:9200/profiles/profile/_search?pretty=true" -d '{"explain":true,"query":{"custom_boost_factor":{"boost_factor":5,"query":{"query_string":{"query":"scala"}}}},"size":20,"from":0}'

A person with a low scala score and one skill still scores higher than a high scala score and more skills:
{"skills":[{"_value":"Scala","_boost":15.018930498774088}]} => score: 52.734985
{"skills":[{"_value":"Scala","_boost":26.03926622472801},{"_value":"erlang","_boost":10},{"_value":"shell","_boost":10}]} => score: 45.74808

We must be able to turn off the fieldNorm for our search to score results correctly. Is omit_norms supposed to disable the fieldNorm and this is a bug? Is there a way to generate a custom score script that looks at the skill boosts of only the relevant skills? Or a custom score that negates the fieldNorm?

On Mar 25, 2012, at 5:40 AM, Shay Banon wrote:

Have you tried to simply boost at query time? I recommend using query time boosting (either using boost on teh query itself, or using custom_boost_factor wrapper).

On Sat, Mar 24, 2012 at 1:43 AM, John Smilanick jsmilanick@gild.com wrote:
I have reduced the problem to this example.

tire.mapping do
indexes :id, :index => :not_analyzed
indexes :skills, :type => 'string', :omit_term_freq_and_positions
=> true, :omit_norms => true
end

It sorts correctly when all the records have a single skill, e.g.

{"skills":[{"_value":"Scala","_boost":26.03926622472801}]} ==> score:
18.299232

The more 'skills' a record has the lower the fieldNorm value is hence
the lower the score. eg

{"skills":[{"_value":"Scala","_boost":26.03926622472801},
{"_value":"erlang","_boost":10}]} ==> 11.437021

It all comes down to the fieldNorm value which starts at 1 for 1 skill
and decreases for more skills. I thought omit_norms should do the
trick, but that doesn't seem to work for me. Any ideas?


(Shay Banon) #5

Maybe you can post a curl recreation and I can have a look?

On Mon, Mar 26, 2012 at 6:26 PM, John Smilanick jsmilanick@gild.com wrote:

neither boosting the query nor custom_boost_factor do not solve the
problem:

curl -XGET "http://localhost:9200/profiles/profile/_search?pretty=true"
-d
'{"explain":true,"query":{"custom_boost_factor":{"boost_factor":5,"query":{"query_string":{"query":"scala"}}}},"size":20,"from":0}'

A person with a low scala score and one skill still scores higher than a
high scala score and more skills:
{"skills":[{"_value":"Scala","_boost":15.018930498774088}]} =>
score: 52.734985
{"skills":[{"_value":"Scala","_boost":26.03926622472801},{"_value":"erlang","_boost":10},{"_value":"shell","_boost":10}]}
=> score: 45.74808

We must be able to turn off the fieldNorm for our search to score results
correctly. Is omit_norms supposed to disable the fieldNorm and this is a
bug? Is there a way to generate a custom score script that looks at the
skill boosts of only the relevant skills? Or a custom score that negates
the fieldNorm?

John Smilanick
Lead Developer | Gild, Inc.
(805) 448-4051

660 3rd Street
San Francisco, CA 94107
www.gild.com

On Mar 25, 2012, at 5:40 AM, Shay Banon wrote:

Have you tried to simply boost at query time? I recommend using query time
boosting (either using boost on teh query itself, or using
custom_boost_factor wrapper).

On Sat, Mar 24, 2012 at 1:43 AM, John Smilanick jsmilanick@gild.comwrote:

I have reduced the problem to this example.

tire.mapping do
indexes :id, :index => :not_analyzed
indexes :skills, :type => 'string', :omit_term_freq_and_positions
=> true, :omit_norms => true
end

It sorts correctly when all the records have a single skill, e.g.

{"skills":[{"_value":"Scala","_boost":26.03926622472801}]} ==> score:
18.299232

The more 'skills' a record has the lower the fieldNorm value is hence
the lower the score. eg

{"skills":[{"_value":"Scala","_boost":26.03926622472801},
{"_value":"erlang","_boost":10}]} ==> 11.437021

It all comes down to the fieldNorm value which starts at 1 for 1 skill
and decreases for more skills. I thought omit_norms should do the
trick, but that doesn't seem to work for me. Any ideas?


(system) #6