I'm unsure of whether or not I'm trying to wrangle elasticsearch to do
something it was not supposed to, but here we go.
We have an index (example_index
) of documents (example
) of the
following prototype (I've excluded unimportant fields):
{
"tag": ["apple", "fruit", "red", ..],
"popularity": 96,
"average_rating": 3.4
..
}
We're currently porting the search functionality from a set of SQL database
queries, and in a very simplistic form, the search for the query string
"apple fruit" result in a query like the following (actually, there's a
down scoring of duplicate terms, but this is just an example to keep
complexity low):
{
"query": {
"custom_score": {
"query": {
"bool": {
"should": [{
"constant_score": {
"query": {
"term": {
"tag": "apple fruit"
}
},
"boost": "16"
}
}, {
"constant_score": {
"query": {
"term": {
"tag": "apple"
}
},
"boost": "8"
}
}, {
"constant_score": {
"query": {
"term": {
"tag": "fruit"
}
},
"boost": "8"
}
}]
}
},
"script": "_score * 4.0 + doc['popularity'].value * 5.0 +
doc['average_rating'].value * 2.0"
}
}
}
As you can probably tell, the boost factor is being used to score different
levels of matches and to make the match scoring fit with the ranges of the
additional scoring variables (the popularity
and average_rating
fields).
However, this does not function as expected, as the boost factor for tag
term is effectively negated by the query normalization factor as can be
seen form the following explanatory hit:
{
"_explanation": {
"description": "custom score, product of:",
"details": [{
"description": "script score function: product of:",
"details": [{
"description": "sum of:",
"details": [{
"description": "ConstantScore(tag:apple)^95.0,
product of:",
"details": [{
"description": "boost",
"value": 95.0
}, {
"description": "queryNorm",
"value": 0.007443229
}],
"value": 0.70710677
}, {
"description": "ConstantScore(tag:fruit)^95.0,
product of:",
"details": [{
"description": "boost",
"value": 95.0
}, {
"description": "queryNorm",
"value": 0.007443229
}],
"value": 0.70710677
}],
"value": 1.4142135
}],
"value": 3406.5686
}, {
"description": "queryBoost",
"value": 1.0
}],
"value": 3406.5686
},
"_id": "45550",
"_index": "example_index",
"_node": "_TRFral5Q7SUf0myvkL73g",
"_score": 3406.5686,
"_shard": 1,
"_source": {
"average_rating": 5.0,
"popularity": 65,
"tag": ["fruit", "food", "apple"],
..
},
"_type": "example"
}
Naturally, I'd expect there to be a way to avoid the query normalization
(and qutie possibly also a more clever way of doing what I'm attempting to
do here) but my digging through the Lucene and elasticsearch documentation
hasn't turned up any obvious solutions.
Can anyone provide a pointer as to what I should be doing either in terms
of changing my approach or fixing the scoring problem?
Best regards,
Nick Bruun
--