This post relates to a problem I have encountered in my business production database, but will be described with a minimal reproducible example.
I have an index with the following mapping:
{
"properties": {
"test_field": {
"type": "keyword"
}
}
}
I have two documents in the index:
{"test_field": ["good", "good", "good", "good"]}
{"test_field": ["good", "good"]}
I am trying to perform a search on terms in this keyword field. When I search for good
, I want the first document to have a higher score because there are more matching terms.
As far as I understand, norms are disabled by default for keyword fields so this shouldn't affect the scoring. When I perform a multi_match
query on this field (there are more fields in the query in my production database), both of these documents receive the same score. As the following excerpt shows, freq
is computed as 1.
{
"value": 0.45454544,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [{
"value": 1,
"description": "freq, occurrences of term within document",
"details": []
},
...
After investigation, it appears that the query thinks there is only one element in the field, as shown by this test query:
{
"query": {
"bool": {
"must": [{
"script_score": {
"query": {
"multi_match": {
"query": "word",
"fields": ["test_field"]
}
},
"script": {
"source": "return doc.test_field.size();"
}
}
}]
}
},
"explain": true
}
which returns the following
{
"took": 39,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_shard": "[test-index][0]",
"_node": "pJAjqOHJSP2UgyvCe_m_8g",
"_index": "test-index",
"_id": "EuLve40BRTvtv3Tg10Vy",
"_score": 1.0,
"_source": {
"test_field": [
"word",
"word",
"word",
"word",
"word"
]
},
"_explanation": {
"value": 1.0,
"description": "script score function, computed with script:\"Script{type=inline, lang='painless', idOrCode='return doc.test_field.size();', options={}, params={}}\"",
"details": []
}
},
{
"_shard": "[test-index][0]",
"_node": "pJAjqOHJSP2UgyvCe_m_8g",
"_index": "test-index",
"_id": "E-Lve40BRTvtv3Tg5kVi",
"_score": 1.0,
"_source": {
"test_field": [
"word",
"word"
]
},
"_explanation": {
"value": 1.0,
"description": "script score function, computed with script:\"Script{type=inline, lang='painless', idOrCode='return doc.test_field.size();', options={}, params={}}\"",
"details": []
}
}
]
}
}
I see that the stored document still contains the duplicate strings in the field, so I am guessing that this is something that happens at index time?
Is this the correct behaviour, and if so, is there a way I can perform the query I need with the scoring I describe above?