Hi,
I'm relatively new to Elasticsearch, so I'm probably missing something trivial here, but I'm having issues with the relevancy score of the search results when it comes to optional fields in documents. Consider the following example:
Test data:
DELETE /my-index
PUT /my-index
POST /my-index/_bulk
{"index":{"_id":"1"}}
{"required_field":"RareWord"}
{"index":{"_id":"2"}}
{"required_field":"RareWord"}
{"index":{"_id":"3"}}
{"required_field":"CommonWord"}
{"index":{"_id":"4"}}
{"required_field":"CommonWord"}
{"index":{"_id":"5"}}
{"required_field":"CommonWord"}
{"index":{"_id":"6"}}
{"required_field":"CommonWord"}
{"index":{"_id":"7"}}
{"required_field":"CommonWord"}
{"index":{"_id":"8"}}
{"required_field":"CommonWord"}
{"index":{"_id":"9"}}
{"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"}
{"index":{"_id":"10"}}
{"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"}
Search Query:
If I run a search query similar to one below:
GET /my-index/_search
{"query":{"multi_match":{"query":"RareWord AnotherRareWord","fields":["required_field","optional_field"]}}}
Expectation
The end-user would expect Document #9 and #10 to score higher than others, because they contain the exact two words of the search query in their optional_field
Reality
Document #1 would score better than #10, even though it only contains one of the the two words of the search query; which is the opposite of what end-users most likely expect.
A closer look at _explain
Here is the _explain results of running the same search query for Document #1:
{
"_index" : "my-index",
"_type" : "_doc",
"_id" : "1",
"matched" : true,
"explanation" : {
"value" : 1.4816045,
"description" : "max of:",
"details" : [
{
"value" : 1.4816045,
"description" : "sum of:",
"details" : [
{
"value" : 1.4816045,
"description" : "weight(required_field:rareword in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 1.4816045,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 1.4816046,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 10,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
}
And here is the _explain results of running the same search query for Document #10:
{
"_index" : "my-index",
"_type" : "_doc",
"_id" : "10",
"matched" : true,
"explanation" : {
"value" : 0.36464313,
"description" : "max of:",
"details" : [
{
"value" : 0.36464313,
"description" : "sum of:",
"details" : [
{
"value" : 0.18232156,
"description" : "weight(optional_field:rareword in 9) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.18232156,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
},
{
"value" : 0.18232156,
"description" : "weight(optional_field:anotherrareword in 9) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.18232156,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
}
As you can see, Document #10 scores worse, mainly due to the lower IDF value (0.18232156). Looking closely, it's because IDF uses N, total number of documents with field: 2 instead of simply considering the total number of the documents in the index: 10.
Question
My question is that is there any way that I could force multi_match query to consider all the documents (instead of only those that contain the field) when computing the IDF value for an optional field, hence resulting in a relevance score which is closer to the expectations of the end-users?
Or alternatively, is there a better way to write the search query, so I get the expected results?
Any help would be greatly appreciated. Thanks.
Regards,
Kaykanloo