Value of docFreq is wrong (using single shard)


we are using Elasticsearch for a search application and are experiencing weird scoring behaviour. I used "explain: true" to understand the scoring and it seems to me that the "docFreq" (number of documents a term is included in) has a wrong value.

I reduced it to the following example (its german, but I think you can understand it nevertheless):

I'm adding two documents with the titles:

I'm using a decompound filter which works pretty good and results in a number of subwords - and if I understand docFreq right, they should have the following docFrequ-Values:
"Büro" = 1
"Kaufmann" = 2
"Kauf" = 2
"Mann" = 2
"Industrie" = 1

Thus, a "Büro" should have a higher score than "Kaufmann", because "Kaufmann" appears in more documents and is thus a more "generic" term.

But when I search with explain, it shows the following details for the term "Buro":

..."description": "weight(title:buro in 1) [PerFieldSimilarity], result of:",...

    				"value": 0.18232156,
    				"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
    				"details": [
    						"value": 2,
    						"description": "docFreq",
    						"details": []
    						"value": 2,
    						"description": "docCount",
    						"details": []


docFreq is 2 instead of 1.

It does this for every subword - it's always docFreq = docCount.
Thus, when I search for "Bürokaufmann" I'm getting the same score for the title "Bürokaufmann" as for the title "Industriekaufmann"... basically I'm getting the same score for everything.

Could my mapping / index settings contain sth. weird that produces this error, or am I simply misunderstanding the meaning of docFreq?

Please note, that I am using only 1 Shard, so the "counts are calculated per shard"-thing is hopefully not my problem(?)

I'm thankfull for any suggestions and tipps!

Okay, I think I found out what is happening:

The docFreq is calculated not per subword, but as sum of the docFreqs for all subwords.

Compoundword made of three Words A, B and C: "ABC"
Indexed document titles:

Search for compound word "ABC" and the explanation will look like: "(A in title)... docFreq=3"
But search only for the single word "A" and it will be: "(A in title)... docFrequ=1"

The docFreq of a single word is correct, but if you search for the compound-word "ABC" the docFreq for a single subword will be the sum of the docFreq for A, B and C.

This becomes clearer in the explain text of new ES version 7, we are still using 5.

I still don't understand, why this is happening though.
We are using a multi_match of type "cross_fields" - maybe that's the problem.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.