Value of docFreq is wrong (using single shard)

fsommer · June 14, 2019, 11:59am

Hello,

we are using Elasticsearch for a search application and are experiencing weird scoring behaviour. I used "explain: true" to understand the scoring and it seems to me that the "docFreq" (number of documents a term is included in) has a wrong value.

I reduced it to the following example (its german, but I think you can understand it nevertheless):

I'm adding two documents with the titles:
"Bürokaufmann"
"Industriekaufmann"

I'm using a decompound filter which works pretty good and results in a number of subwords - and if I understand docFreq right, they should have the following docFrequ-Values:
"Büro" = 1
"Kaufmann" = 2
"Kauf" = 2
"Mann" = 2
"Industrie" = 1

Thus, a "Büro" should have a higher score than "Kaufmann", because "Kaufmann" appears in more documents and is thus a more "generic" term.

But when I search with explain, it shows the following details for the term "Buro":

..."description": "weight(title:buro in 1) [PerFieldSimilarity], result of:",...
`

    			{
    				"value": 0.18232156,
    				"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
    				"details": [
    					{
    						"value": 2,
    						"description": "docFreq",
    						"details": []
    					},
    					{
    						"value": 2,
    						"description": "docCount",
    						"details": []
    					}
    				]
    			},

`

docFreq is 2 instead of 1.

It does this for every subword - it's always docFreq = docCount.
Thus, when I search for "Bürokaufmann" I'm getting the same score for the title "Bürokaufmann" as for the title "Industriekaufmann"... basically I'm getting the same score for everything.

Could my mapping / index settings contain sth. weird that produces this error, or am I simply misunderstanding the meaning of docFreq?

Please note, that I am using only 1 Shard, so the "counts are calculated per shard"-thing is hopefully not my problem(?)

I'm thankfull for any suggestions and tipps!

fsommer · June 17, 2019, 3:54pm

Okay, I think I found out what is happening:

The docFreq is calculated not per subword, but as sum of the docFreqs for all subwords.

Example:
Compoundword made of three Words A, B and C: "ABC"
Indexed document titles:
"A..."
"B..."
"C..."

Search for compound word "ABC" and the explanation will look like: "(A in title)... docFreq=3"
But search only for the single word "A" and it will be: "(A in title)... docFrequ=1"

The docFreq of a single word is correct, but if you search for the compound-word "ABC" the docFreq for a single subword will be the sum of the docFreq for A, B and C.

This becomes clearer in the explain text of new ES version 7, we are still using 5.

I still don't understand, why this is happening though.
We are using a multi_match of type "cross_fields" - maybe that's the problem.

system · July 15, 2019, 3:54pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help me understand how ES calculate the score to match query Elasticsearch	5	1328	July 6, 2017
Different docFreq and docCount for the same shard Elasticsearch	1	572	September 10, 2018
What does “docCount” and "docFreq" mean in the Explain API? Elasticsearch	8	3105	February 12, 2019
Computing idf in elasticsearch Elasticsearch	5	365	July 6, 2017
Document Frequenct in dfs_query_then_fetch Elasticsearch	1	605	December 8, 2016

Value of docFreq is wrong (using single shard)

Related topics