What does “docCount” and "docFreq" mean in the Explain API?

Masanori_Ohnishi · January 11, 2019, 6:42am

Here's the sample of mapping, register, and search query.

mapping

curl -X PUT "es:9200/english1" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "_doc": {
      "properties": {
        "header" : {
          "type" : "text"
        },
        "body" : {
          "type" : "text"
        }
      }
    }
  }
}
'

register

curl -X PUT "es:9200/english1/_doc/1?refresh" -H 'Content-Type: application/json' -d'
{
  "header": "something special",
  "body": "I am John"
}
'

curl -X PUT "es:9200/english1/_doc/2?refresh" -H 'Content-Type: application/json' -d'
{
  "header": "something better",
  "body": "You are Chris"
}
'

curl -X PUT "es:9200/english1/_doc/3?refresh" -H 'Content-Type: application/json' -d'
{
  "header": "anything hot",
  "body": "This is a cup"
}
'

curl -X PUT "es:9200/english1/_doc/4?refresh" -H 'Content-Type: application/json' -d'
{
  "header": "anything cold",
  "body": "That is a glass"
}
'

search

curl -XGET 'es:9200/english1/_search?pretty' -H 'Content-Type: application/json' -d'
{
   "query" : {
        "simple_query_string":{
        "query": "something",
        "fields": ["header","body"]
      }
    },
    "explain": true
}'

result

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.6931472,
    "hits" : [
      {
        "_shard" : "[english1][2]",
        "_node" : "sN3QHj7oRF-rgbBbs4U6lw",
        "_index" : "english1",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.6931472,
        "_source" : {
          "header" : "something better",
          "body" : "You are Chris"
        },
        "_explanation" : {
          "value" : 0.6931472,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.6931472,
              "description" : "weight(header:something in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.6931472,
                  "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 0.6931472,
                      "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "docFreq",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "docCount",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "termFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "parameter b",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "avgFieldLength",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "fieldLength",
                          "details" : [ ]
                        }
...
      },
      {
        "_shard" : "[english1][3]",
        "_node" : "sN3QHj7oRF-rgbBbs4U6lw",
        "_index" : "english1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "header" : "something special",
          "body" : "I am John"
        },
        "_explanation" : {
          "value" : 0.2876821,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.2876821,
              "description" : "weight(header:something in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.2876821,
                  "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 0.2876821,
                      "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "docFreq",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.0,
                          "description" : "docCount",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "termFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "parameter b",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "avgFieldLength",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "fieldLength",
                          "details" : [ ]
                        }
...
}

According to this Q&A[Understanding doc and docCount values in explain response],
if docCount the number of docs in my index, I assume docCount will be 4(but it was wrong in this response.).

Also I cant't understand docFreq.

Please help me...

s1monw · January 11, 2019, 7:03am

All these statistics are per shard not per index. Use a single shard instead of 5 then your stats will be accurate.

Masanori_Ohnishi · January 11, 2019, 10:52am

Thank you very much!
The solution works!!

couple of follow up question,

the limit of size of a single shard seems to be 50GB according to the blog.(https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster)
If the data size exeeds 50GB, how I can increase shard num?
And Initially, is it common to use a single shard in making products?

Forgive me for stealing you time again....

s1monw · January 11, 2019, 12:11pm

you are not stealing anybodies time.

There are some limits but size is not the limit. Yet, that said massive single shards will at some point not give you the performance you need. You can use more than one shard and you should if you have enough data. The number of shards is determined at index creation time but you can still use the _split API if you wanna use more shards.

it depends on how much data you have, yet it's not uncommon and a good place to start.

I take from your original question that you need the statistics to be accurate? Why is this the case? Can you explain why you need a accurate docCount values?

s1monw · January 11, 2019, 12:15pm

you are not stealing anybodies time.

There are some limits but size is not the limit. Yet, that said massive single shards will at some point not give you the performance you need. You can use more than one shard and you should if you have enough data. The number of shards is determined at index creation time but you can still use the _split API if you wanna use more shards.

it depends on how much data you have, yet it's not uncommon and a good place to start.

I take from your original question that you need the statistics to be accurate? Why is this the case? Can you explain why you need a accurate docCount values?

s1monw · January 11, 2019, 12:19pm

you are not stealing anybodies time.

There are some limits but size is not the limit. Yet, that said massive single shards will at some point not give you the performance you need. You can use more than one shard and you should if you have enough data. The number of shards is determined at index creation time but you can still use the _split API if you wanna use more shards.

it depends on how much data you have, yet it's not uncommon and a good place to start.

I take from your original question that you need the statistics to be accurate? Why is this the case? Can you explain why you need accurate docCount values?

Masanori_Ohnishi · January 15, 2019, 5:15am

Thanks a lot !!
(Sorry for replying late...)

I take from your original question that you need the statistics to be accurate? Why is this the case? Can you explain why you need accurate docCount values?

Currently I deal with a small amount of data, so I want to need the statistics to be accurate.
However, since data will increase in the future, I wanted to know how to cope when the data increased.

In conclusion, I will use a single shard, and use split api if necessary.

s1monw · January 15, 2019, 8:37am

I agree that is a good solution. Once you are beyond one shard you can still use the DFS query then fetch search type to get accurate stats. That requires an additional roundtrip and might be overkill. I recommend reading the linked article.

simon

system · February 12, 2019, 8:37am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What does "docCount" mean in the Explain API? Elasticsearch	2	2684	July 27, 2018
Understanding doc and docCount values in explain response Elasticsearch	4	2471	December 31, 2018
Explain API - All these statistics are per shard not per index Elasticsearch	11	510	May 27, 2020
Help with `docCount` Elasticsearch	1	264	January 4, 2021
Relevance Score calculation Elasticsearch	1	369	August 1, 2018

What does “docCount” and "docFreq" mean in the Explain API?

mapping

register

search

result

Related topics