Understanding doc and docCount values in explain response

#1

Here's the sample response after running query with `explain` set to `true`,

``````[0] => Array
(
[value] => 24.375515
[description] => score(doc=1115115,freq=1.0 = termFreq=1.0), product of:
[details] => Array
(
[0] => Array
(
[value] => 3
[description] => boost
[details] => Array
(
)

)

[1] => Array
(
[value] => 7.09822
[description] => idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
[details] => Array
(
[0] => Array
(
[value] => 976
[description] => docFreq
[details] => Array
(
)

)

[1] => Array
(
[value] => 1181380
[description] => docCount
[details] => Array
(
)

)

)

)

[2] => Array
(
[value] => 1.1446774
[description] => tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
[details] => Array
(
[0] => Array
(
[value] => 1
[description] => termFreq=1.0
[details] => Array
(
)

)

[1] => Array
(
[value] => 1.2
[description] => parameter k1
[details] => Array
(
)

)

[2] => Array
(
[value] => 0.75
[description] => parameter b
[details] => Array
(
)

)

[3] => Array
(
[value] => 5.788349
[description] => avgFieldLength
[details] => Array
(
)

)

[4] => Array
(
[value] => 4
[description] => fieldLength
[details] => Array
(
)

)

)

)

)

)
``````

Within this subset, I am trying to understand,

1. How `doc` and `docCount` are being calculated?
2. What's the difference between these 2 fields?
3. What impact does `docs.deleted` field has in its calculation considering the values changes if the index contains some deleted documents?

Any help will be appreciated!

(Simon Willnauer) #2

hey, lemme try to explain:

• `doc` is the internal docID. it's totally irrelevant to you it's the Nth document in a segment.
• `docCount` is the total number of documents that have at least one term in the field. For simplicity you can think of it as the number of docs in you index
• `docs.deleted` is the number of documents marked as deleted. We take them still into account when we score ie. `docCount` will contain deleted docs.

hope this helps

What does “docCount” and "docFreq" mean in the Explain API?
#3

It does help! Thanks

1. I am assuming then that values for `docCount` for each field will not change for a static index, irrespective of the query being executed. Is that a correct assumption?
2. Is there a way to pull out `docCount` per field or is it something that's accessible/visible through explain api only?
3. Does `docFreq` imply number of documents with successful hits (or term appears in that field)? If so for a simple match query, like following,
``````POST /development_en/catalog/_search?search_type=dfs_query_then_fetch&preference=local_testing&filter_path=hits.total,hits.hits._explanation
{
"size": 1,
"explain": true,
"query": {
"match": {
"product.name": "glove"
}
}
}
``````
shouldn't `hits.total` match `docFreq` ? I get the following response,
`````` {
"hits": {
"total": 660,
"hits": [
{
"_explanation": {
"value": 6.9444885,
"description": "sum of:",
"details": [
{
"value": 6.9444885,
"description": "weight(product.name:glove in 690378) [PerFieldSimilarity], result of:",
"details": [
{
"value": 6.9444885,
"description": "score(doc=690378,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 5.3614783,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 5607,
"description": "docFreq",
"details": []
},
{
"value": 1194619,
"description": "docCount",
"details": []
}
]
},
{
"value": 1.2952563,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 5.7816,
"description": "avgFieldLength",
"details": []
},
{
"value": 2.56,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
},
{
"value": 0,
"description": "match on required clause, product of:",
"details": [
{
"value": 0,
"description": "# clause",
"details": []
},
{
"value": 1,
"description": "_type:catalog, product of:",
"details": [
{
"value": 1,
"description": "boost",
"details": []
},
{
"value": 1,
"description": "queryNorm",
"details": []
}
]
}
]
}
]
}
}
]
}
}
``````

Thanks again for you time. I am trying to wrap my head around some of the numbers that Elasticsearch uses!

(Simon Willnauer) #4

Yes that is correct.

not at this point

I think the confusion comes from `docFreq` being per shard and `hits.total` is hits across shards.

hope that makes sense

