I have 2 identical indices with similar (overlapping) data though one index has more data. I am getting the same documents in each index with the same query with a different order.
When I search for "foo" (boost 2) and "bar" I expect to get documents with "foo" and "bar" before documents with just "foo". In Index 1 this works correctly and I get "some foo blah bar document" as the first result.
In Index 2 I get "foo some" as the first result and the desired result "some foo blah bar" is waaaay down the order with a lower score.
I want to understand or rather influence the score: more occurrences must always score higher in our scenario.
Can anyone explain the "wrong" result below?
Can we determine/change the algorithm for scoring?
Query (same for both indices)
{
"from": 0,
"size": 1000,
"_source": ["text_en"],
"explain": true,
"query": {
"bool": {
"should": [
{
"match": {
"text_en": {
"query": "foo",
"boost": 2
}
}
},
{
"match": {
"text_en": "bar"
}
}
],
"minimum_should_match": 1
}
}
}
Result from Index 1 (correct)
Here we get our document "some foo 205 x bar"
as the first result with the highest score:
"_score": 19.79837,
"_source": {
"text_en": "some foo 205 x bar"
},
"_explanation": {
"value": 19.79837,
"description": "sum of:",
"details": [
{
"value": 14.814142,
"description": "weight(text_en:foo in 31366) [PerFieldSimilarity], result of:",
"details": [
{
"value": 14.814142,
"description": "score(freq=1.0), product of:",
"details": [
{
"value": 4.4,
"description": "boost",
"details": []
},
{
"value": 8.928385,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 195,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1474669,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.37709513,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 29.0,
"description": "dl, length of field",
"details": []
},
{
"value": 19.306864,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
},
{
"value": 4.9842277,
"description": "weight(text_en:bar in 31366) [PerFieldSimilarity], result of:",
"details": [
{
"value": 4.9842277,
"description": "score(freq=1.0), product of:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 6.0079217,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 3626,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 1474669,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.37709513,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 1.0,
"description": "freq, occurrences of term within document",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 29.0,
"description": "dl, length of field",
"details": []
},
{
"value": 19.306864,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
]
Result from Index 2
Here the wrong document "some foo"
get's a higher score (21) than our desired document (19)
"_score": 21.883192,
"_source": {
"text_en": "some foo",
},
"_explanation": {
"value": 21.883192,
"description": "sum of:",
"details": [
The explanation is similar to above.