Per Shard Statistics


(Rongduan Zhu) #1

Hi guys, I'm new to Elastic. Recently I've been trying to understand the scoring of Elastic using the Explain API.

I've read that when doing a normal match query, the statistics of each result is calculated on a per shard basis.

Since statistics are done on a per shard basis, the idf score is then different for the same term for documents in different shards.

My question is then:
Wouldn't this affect the correctness (not sure if thats the right term) of the results? Because idf score should be the same for all matching documents.

Thanks


(Colin Goodheart-Smithe) #2

Yes, you are right that the IDF score will be different for the same term in different shards. For the standard search type, QUERY_THEN_FETCH, we work under the assumption that the term statistics for a specific shard is representative of the term statistics of the entire index (across all shards). I.e. the IDF scores for a particular term will not vary significantly between shards.

This usually works wells, especially with default routing which spreads the documents between shards in a pseudo-random fashion (it actually takes the modulo of a hash of the document's id so that it's deterministic but the main thing here is the routing of a document is independent of the documents contents).

For situations where this assumption is not true (usually when using custom routing) Elasticsearch has a specialised search type called DFS_QUERY_THEN_FETCH which does an extra initial visit to each shard to gather the required term frequencies and so that it can compute more accurate scores in the fetch phase using the term frequencies from all shards. This accuracy comes at a performance cost as we must do an extra request to gather these term statistics before actually performing the search, as well as having to transfer these statistics to the shards in the fetch phase.

Hope that helps


(Rongduan Zhu) #3

Hi Colin,

The answer is really good! Really appreciate it!

I tried using the dfs_query_then_fetch as the search type and like you and the docs said, the term statistics are now per index.

However, I've noticed that the document count used for the calculations are still per shard. So maxDocs (which I assume is the document count) is the number of documents in the shard rather than in the index. Although maxDocs only differs by 1 between shards, the score is still affected (as my index don't lots of documents).

An example is that two documents with one of the field having the same content, when searching on that field, produces different results because the maxDocs is off by 1 between the two.

So I was wondering is something that I can specify such that the document count is also the index's document count.

Thanks in advance.
Kevin


(Adrien Grand) #4

What makes you think that only the local maxDoc is taken into account for scoring? I tried to build a quick recreation using Sense and things seem to work just fine:

DELETE test

PUT test/test/1
{
  "foo": "bar"
}

PUT test/test/2
{
  "foo": "bar"
}

PUT test/test/3
{
  
}

GET test/_search?search_type=dfs_query_then_fetch
{
  "explain": true, 
  "query": {
    "term": {
      "foo": {
        "value": "bar"
      }
    }
  }
}

returns

{
   "took": 11,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 1,
      "hits": [
         {
            "_shard": 2,
            "_node": "ws4GOXSzRPy0Jg6-Yb0C6g",
            "_index": "test",
            "_type": "test",
            "_id": "2",
            "_score": 1,
            "_source": {
               "foo": "bar"
            },
            "_explanation": {
               "value": 1,
               "description": "weight(foo:bar in 0) [PerFieldSimilarity], result of:",
               "details": [
                  {
                     "value": 1,
                     "description": "fieldWeight in 0, product of:",
                     "details": [
                        {
                           "value": 1,
                           "description": "tf(freq=1.0), with freq of:",
                           "details": [
                              {
                                 "value": 1,
                                 "description": "termFreq=1.0",
                                 "details": []
                              }
                           ]
                        },
                        {
                           "value": 1,
                           "description": "idf(docFreq=2, maxDocs=3)",
                           "details": []
                        },
                        {
                           "value": 1,
                           "description": "fieldNorm(doc=0)",
                           "details": []
                        }
                     ]
                  }
               ]
            }
         },
         {
            "_shard": 3,
            "_node": "ws4GOXSzRPy0Jg6-Yb0C6g",
            "_index": "test",
            "_type": "test",
            "_id": "1",
            "_score": 1,
            "_source": {
               "foo": "bar"
            },
            "_explanation": {
               "value": 1,
               "description": "weight(foo:bar in 0) [PerFieldSimilarity], result of:",
               "details": [
                  {
                     "value": 1,
                     "description": "fieldWeight in 0, product of:",
                     "details": [
                        {
                           "value": 1,
                           "description": "tf(freq=1.0), with freq of:",
                           "details": [
                              {
                                 "value": 1,
                                 "description": "termFreq=1.0",
                                 "details": []
                              }
                           ]
                        },
                        {
                           "value": 1,
                           "description": "idf(docFreq=2, maxDocs=3)",
                           "details": []
                        },
                        {
                           "value": 1,
                           "description": "fieldNorm(doc=0)",
                           "details": []
                        }
                     ]
                  }
               ]
            }
         }
      ]
   }
}

The interesting part is that the idf calculation reports maxDoc=3 even though my documents are on different shards.

The only limitation I'm aware of is if you have more than 2B documents in your index, elasticsearch will use maxDoc=2B since Lucene stores maxDoc on an integer.


(system) #5