Understanding doc and docCount values in explain response


#1

Here's the sample response after running query with explain set to true,

[0] => Array
    (
        [value] => 24.375515
        [description] => score(doc=1115115,freq=1.0 = termFreq=1.0), product of:
        [details] => Array
            (
                [0] => Array
                    (
                        [value] => 3
                        [description] => boost
                        [details] => Array
                            (
                            )

                    )

                [1] => Array
                    (
                        [value] => 7.09822
                        [description] => idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
                        [details] => Array
                            (
                                [0] => Array
                                    (
                                        [value] => 976
                                        [description] => docFreq
                                        [details] => Array
                                            (
                                            )

                                    )

                                [1] => Array
                                    (
                                        [value] => 1181380
                                        [description] => docCount
                                        [details] => Array
                                            (
                                            )

                                    )

                            )

                    )

                [2] => Array
                    (
                        [value] => 1.1446774
                        [description] => tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
                        [details] => Array
                            (
                                [0] => Array
                                    (
                                        [value] => 1
                                        [description] => termFreq=1.0
                                        [details] => Array
                                            (
                                            )

                                    )

                                [1] => Array
                                    (
                                        [value] => 1.2
                                        [description] => parameter k1
                                        [details] => Array
                                            (
                                            )

                                    )

                                [2] => Array
                                    (
                                        [value] => 0.75
                                        [description] => parameter b
                                        [details] => Array
                                            (
                                            )

                                    )

                                [3] => Array
                                    (
                                        [value] => 5.788349
                                        [description] => avgFieldLength
                                        [details] => Array
                                            (
                                            )

                                    )

                                [4] => Array
                                    (
                                        [value] => 4
                                        [description] => fieldLength
                                        [details] => Array
                                            (
                                            )

                                    )

                            )

                    )

            )

    )

Within this subset, I am trying to understand,

  1. How doc and docCount are being calculated?
  2. What's the difference between these 2 fields?
  3. What impact does docs.deleted field has in its calculation considering the values changes if the index contains some deleted documents?

Any help will be appreciated!


(Simon Willnauer) #2

hey, lemme try to explain:

  • doc is the internal docID. it's totally irrelevant to you it's the Nth document in a segment.
  • docCount is the total number of documents that have at least one term in the field. For simplicity you can think of it as the number of docs in you index
  • docs.deleted is the number of documents marked as deleted. We take them still into account when we score ie. docCount will contain deleted docs.

hope this helps


#3

It does help! Thanks :slight_smile:

Couple of follow up question,

  1. I am assuming then that values for docCount for each field will not change for a static index, irrespective of the query being executed. Is that a correct assumption?
  2. Is there a way to pull out docCount per field or is it something that's accessible/visible through explain api only?
  3. Does docFreq imply number of documents with successful hits (or term appears in that field)? If so for a simple match query, like following,
    POST /development_en/catalog/_search?search_type=dfs_query_then_fetch&preference=local_testing&filter_path=hits.total,hits.hits._explanation
    {
      "size": 1,
      "explain": true,
      "query": {
        "match": {
          "product.name": "glove"
        }
      }
    }
    
    shouldn't hits.total match docFreq ? I get the following response,
     {
       "hits": {
         "total": 660,
         "hits": [
           {
             "_explanation": {
               "value": 6.9444885,
               "description": "sum of:",
               "details": [
                 {
                   "value": 6.9444885,
                   "description": "weight(product.name:glove in 690378) [PerFieldSimilarity], result of:",
                   "details": [
                     {
                       "value": 6.9444885,
                       "description": "score(doc=690378,freq=1.0 = termFreq=1.0\n), product of:",
                       "details": [
                         {
                           "value": 5.3614783,
                           "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                           "details": [
                             {
                               "value": 5607,
                               "description": "docFreq",
                               "details": []
                             },
                             {
                               "value": 1194619,
                               "description": "docCount",
                               "details": []
                             }
                           ]
                         },
                         {
                           "value": 1.2952563,
                           "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                           "details": [
                             {
                               "value": 1,
                               "description": "termFreq=1.0",
                               "details": []
                             },
                             {
                               "value": 1.2,
                               "description": "parameter k1",
                               "details": []
                             },
                             {
                               "value": 0.75,
                               "description": "parameter b",
                               "details": []
                             },
                             {
                               "value": 5.7816,
                               "description": "avgFieldLength",
                               "details": []
                             },
                             {
                               "value": 2.56,
                               "description": "fieldLength",
                               "details": []
                             }
                           ]
                         }
                       ]
                     }
                   ]
                 },
                 {
                   "value": 0,
                   "description": "match on required clause, product of:",
                   "details": [
                     {
                       "value": 0,
                       "description": "# clause",
                       "details": []
                     },
                     {
                       "value": 1,
                       "description": "_type:catalog, product of:",
                       "details": [
                         {
                           "value": 1,
                           "description": "boost",
                           "details": []
                         },
                         {
                           "value": 1,
                           "description": "queryNorm",
                           "details": []
                         }
                       ]
                     }
                   ]
                 }
               ]
             }
           }
         ]
       }
     }
    

Thanks again for you time. I am trying to wrap my head around some of the numbers that Elasticsearch uses!


(Simon Willnauer) #4

Yes that is correct.

not at this point

I think the confusion comes from docFreq being per shard and hits.total is hits across shards.

hope that makes sense