Hello,
we are using Elasticsearch for a search application and are experiencing weird scoring behaviour. I used "explain: true" to understand the scoring and it seems to me that the "docFreq" (number of documents a term is included in) has a wrong value.
I reduced it to the following example (its german, but I think you can understand it nevertheless):
I'm adding two documents with the titles:
"Bürokaufmann"
"Industriekaufmann"
I'm using a decompound filter which works pretty good and results in a number of subwords - and if I understand docFreq right, they should have the following docFrequ-Values:
"Büro" = 1
"Kaufmann" = 2
"Kauf" = 2
"Mann" = 2
"Industrie" = 1
Thus, a "Büro" should have a higher score than "Kaufmann", because "Kaufmann" appears in more documents and is thus a more "generic" term.
But when I search with explain, it shows the following details for the term "Buro":
..."description": "weight(title:buro in 1) [PerFieldSimilarity], result of:",...
`
{
"value": 0.18232156,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 2,
"description": "docFreq",
"details": []
},
{
"value": 2,
"description": "docCount",
"details": []
}
]
},
`
docFreq is 2 instead of 1.
It does this for every subword - it's always docFreq = docCount.
Thus, when I search for "Bürokaufmann" I'm getting the same score for the title "Bürokaufmann" as for the title "Industriekaufmann"... basically I'm getting the same score for everything.
Could my mapping / index settings contain sth. weird that produces this error, or am I simply misunderstanding the meaning of docFreq?
Please note, that I am using only 1 Shard, so the "counts are calculated per shard"-thing is hopefully not my problem(?)
I'm thankfull for any suggestions and tipps!