Why does IDF differs on hits with same query?

Ulrik_Pedersen · June 24, 2016, 8:21am

Hi

I'm trying to understand how scoring is affecting my search results - but seems to misunderstand the IDF term.

When quering a single index and one single type of documents the values used for calculating the IDF is not equal on all results.

In my understanding it should be equal as the IDF is calculated by how many of the overall documents the search terms appears inside.

"description": "weight(file_content:manual in 0) [PerFieldSimilarity], result of:",
"details": [
  {
    "value": 0.055544302,
    "description": "score(doc=0,freq=1.0), product of:",
    "details": [
      {
        "value": 0.5771883,
        "description": "queryWeight, product of:",
        "details": [
          {
            "value": 3.0794415,
            "description": "idf(docFreq=1, maxDocs=16)",
            "details": []
          },
          {
            "value": 0.18743278,
            "description": "queryNorm",
            "details": []
          }
        ]
      },

The docFreq and maxDocs is ex. 1 and 8 in the next hit even though it is the same search term and the same index.

How is that possible?

mvg · June 24, 2016, 9:18am

Maybe that hit comes from a different shard? Have you tried setting search_type parameter to dfs_query_then_fetch? This would compute the distributed docFreq so when all shards work with the same docFreq when computing the score.

Ulrik_Pedersen · June 24, 2016, 9:35am

OK - here comes the answer...

The maxDocs is only referering to documents in the same shard as the hit came from.

This has a significant impact on the final score in my test case where I only have a few documents indexed in my cluster and doing a multimatch with two search terms.

The most relevant document where both search terms is represented in the searched field, is only presented as hit number 3, even though all logic says that it should be numer 1. Hit number 1 and 2 does only include the first search term.

This happens because the documents is indexed on 3 different shards and one of the search terms only appears in 1 document on each shard and at the same time the number of documents on each shard differs from 9 to 31.

This of course gives a difference in the overall scoring.

Shouldn't the maxDocs in the optimal setup be calculated at cluster level??

Ulrik_Pedersen · June 24, 2016, 9:36am

Ah, sorry your post came before mine

Great - I'll try that.

UPDATE:

This blog https://www.elastic.co/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch gives a good understanding of why this situation happens - and all in all it concludes that this most probably wont occur if the cluster contains enough data.

In developments phases some might not have more than 100 documents in their cluster so the learning must be to have closer to the expected number of documents in the production environment (or at least "many" documents on each shard) before trying to tweak the search results.

Topic		Replies	Views
Different IDF for different documents Elasticsearch	2	452	July 27, 2018
Why is idf different for same term in same field in same shard? Elasticsearch	3	910	July 5, 2017
Per Shard Statistics Elasticsearch	4	1150	July 6, 2017
"dfs_query_then_fetch" and "query_then_fetch" return the same score Elasticsearch	6	886	February 13, 2018
Document score explanation values (maxDocs ?) Elasticsearch	3	924	July 6, 2017

Why does IDF differs on hits with same query?

Related topics