What does “docCount” and "docFreq" mean in the Explain API?


(Masanori Ohnishi) #1

Here's the sample of mapping, register, and search query.

mapping

curl -X PUT "es:9200/english1" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "_doc": {
      "properties": {
        "header" : {
          "type" : "text"
        },
        "body" : {
          "type" : "text"
        }
      }
    }
  }
}
'

register

curl -X PUT "es:9200/english1/_doc/1?refresh" -H 'Content-Type: application/json' -d'
{
  "header": "something special",
  "body": "I am John"
}
'

curl -X PUT "es:9200/english1/_doc/2?refresh" -H 'Content-Type: application/json' -d'
{
  "header": "something better",
  "body": "You are Chris"
}
'

curl -X PUT "es:9200/english1/_doc/3?refresh" -H 'Content-Type: application/json' -d'
{
  "header": "anything hot",
  "body": "This is a cup"
}
'

curl -X PUT "es:9200/english1/_doc/4?refresh" -H 'Content-Type: application/json' -d'
{
  "header": "anything cold",
  "body": "That is a glass"
}
'

search

curl -XGET 'es:9200/english1/_search?pretty' -H 'Content-Type: application/json' -d'
{
   "query" : {
        "simple_query_string":{
        "query": "something",
        "fields": ["header","body"]
      }
    },
    "explain": true
}'

result

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.6931472,
    "hits" : [
      {
        "_shard" : "[english1][2]",
        "_node" : "sN3QHj7oRF-rgbBbs4U6lw",
        "_index" : "english1",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.6931472,
        "_source" : {
          "header" : "something better",
          "body" : "You are Chris"
        },
        "_explanation" : {
          "value" : 0.6931472,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.6931472,
              "description" : "weight(header:something in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.6931472,
                  "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 0.6931472,
                      "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "docFreq",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "docCount",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "termFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "parameter b",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "avgFieldLength",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "fieldLength",
                          "details" : [ ]
                        }
...
      },
      {
        "_shard" : "[english1][3]",
        "_node" : "sN3QHj7oRF-rgbBbs4U6lw",
        "_index" : "english1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "header" : "something special",
          "body" : "I am John"
        },
        "_explanation" : {
          "value" : 0.2876821,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.2876821,
              "description" : "weight(header:something in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.2876821,
                  "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 0.2876821,
                      "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "docFreq",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.0,
                          "description" : "docCount",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "termFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "parameter b",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "avgFieldLength",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "fieldLength",
                          "details" : [ ]
                        }
...
}

According to this Q&A[Understanding doc and docCount values in explain response],
if docCount the number of docs in my index, I assume docCount will be 4(but it was wrong in this response.).

Also I cant't understand docFreq.

Please help me...


(Simon Willnauer) #2

All these statistics are per shard not per index. Use a single shard instead of 5 then your stats will be accurate.


(Masanori Ohnishi) #3

Thank you very much!
The solution works!!

couple of follow up question,

Forgive me for stealing you time again....:disappointed_relieved:


(Simon Willnauer) #4

you are not stealing anybodies time.

There are some limits but size is not the limit. Yet, that said massive single shards will at some point not give you the performance you need. You can use more than one shard and you should if you have enough data. The number of shards is determined at index creation time but you can still use the _split API if you wanna use more shards.

it depends on how much data you have, yet it's not uncommon and a good place to start.

I take from your original question that you need the statistics to be accurate? Why is this the case? Can you explain why you need a accurate docCount values?


(Simon Willnauer) #5

you are not stealing anybodies time.

There are some limits but size is not the limit. Yet, that said massive single shards will at some point not give you the performance you need. You can use more than one shard and you should if you have enough data. The number of shards is determined at index creation time but you can still use the _split API if you wanna use more shards.

it depends on how much data you have, yet it's not uncommon and a good place to start.

I take from your original question that you need the statistics to be accurate? Why is this the case? Can you explain why you need a accurate docCount values?


(Simon Willnauer) #6

you are not stealing anybodies time.

There are some limits but size is not the limit. Yet, that said massive single shards will at some point not give you the performance you need. You can use more than one shard and you should if you have enough data. The number of shards is determined at index creation time but you can still use the _split API if you wanna use more shards.

it depends on how much data you have, yet it's not uncommon and a good place to start.

I take from your original question that you need the statistics to be accurate? Why is this the case? Can you explain why you need accurate docCount values?


(Masanori Ohnishi) #7

Thanks a lot !!
(Sorry for replying late...)

I take from your original question that you need the statistics to be accurate? Why is this the case? Can you explain why you need accurate docCount values?

Currently I deal with a small amount of data, so I want to need the statistics to be accurate.
However, since data will increase in the future, I wanted to know how to cope when the data increased.

In conclusion, I will use a single shard, and use split api if necessary.


(Simon Willnauer) #8

I agree that is a good solution. Once you are beyond one shard you can still use the DFS query then fetch search type to get accurate stats. That requires an additional roundtrip and might be overkill. I recommend reading the linked article.

simon


(system) closed #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.