Help interpreting explain results, IDF behavior (newbie)

redec · April 15, 2020, 11:26pm

I have 2 documents which match my filter, and both have an identical value in the field being queried, but yet they yield vastly different scores.

Here is the returned result:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.72951484,
    "hits" : [
      {
        "_shard" : "[taskassignment][2]",
        "_node" : "yCeD_OyyQqqbBRoMgqP_ng",
        "_index" : "taskassignment",
        "_type" : "_doc",
        "_id" : "0536f1edb103480f9d7917fdb29a2f09",
        "_score" : 0.72951484,
        "_source" : {
          "tenantSlug" : "0536f1edb103480f9d7917fdb29a2f09",
          "project" : {
            "name" : "asd",
          },
        },
        "_explanation" : {
          "value" : 0.72951484,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.72951484,
              "description" : "weight(project.name.ngram:a in 11) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.72951484,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.72951484,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 13,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 27,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.45454544,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.0,
              "description" : "match on required clause, product of:",
              "details" : [
                {
                  "value" : 0.0,
                  "description" : "# clause",
                  "details" : [ ]
                },
                {
                  "value" : 1.0,
                  "description" : "tenantSlug:0536f1edb103480f9d7917fdb29a2f09",
                  "details" : [ ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[taskassignment][3]",
        "_node" : "FmUxDSnbT8qvwSkPtC3Agg",
        "_index" : "taskassignment",
        "_type" : "_doc",
        "_id" : "9536f1edb102480f9d7117fdb29a2faa",
        "_score" : 0.3276874,
        "_source" : {
          "tenantSlug" : "0536f1edb103480f9d7917fdb29a2f09",
          "project" : {
            "name" : "asd",
          },
          "task" : {
            "name" : "vbnt",
          },
        },
        "_explanation" : {
          "value" : 0.3276874,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.3276874,
              "description" : "weight(project.name.ngram:a in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.3276874,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.3276874,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 24,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 33,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.45454544,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.0,
              "description" : "match on required clause, product of:",
              "details" : [
                {
                  "value" : 0.0,
                  "description" : "# clause",
                  "details" : [ ]
                },
                {
                  "value" : 1.0,
                  "description" : "tenantSlug:0536f1edb103480f9d7917fdb29a2f09",
                  "details" : [ ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

This is the query I ran:

GET /taskassignment/_search
{
  "explain": true,
  "query": {
    "bool": {
      "must": { 
        "match": { "project.name.ngram": "a" }
      },
      "filter": {
        "term": { "tenantSlug": "0536f1edb103480f9d7917fdb29a2f09"}
      }
    }
  }
}

This is my mappings/settings:

{
  "taskassignment" : {
    "mappings" : {
      "properties" : {
        "project" : {
          "properties" : {
            "name" : {
              "type" : "text",
              "fields" : {
                "ngram" : {
                  "type" : "text",
                  "analyzer" : "ngram"
                }
              }
            },
          }
        },
        "tenantSlug" : {
          "type" : "keyword",
          "ignore_above" : 256
        },
      }
    },
    "settings" : {
      "index" : {
        "analysis" : {
          "analyzer" : {
            "ngram" : {
              "filter" : [ "lowercase" ],
              "tokenizer" : "ngram"
            }
          },
          "tokenizer" : {
            "ngram" : {
              "token_chars" : [
                "letter",
                "digit"
              ],
              "min_gram" : "1",
              "type" : "ngram",
              "max_gram" : "2"
            }
          }
        },
      }
    }
  }
}

From what I can tell, it's detecting different document counts for the idf calculation for different records within the same query....how is this possible? Like, do I understand it correctly, that's counting the number of documents that have the letter 'a' in the project.name field, right? Is that document count supposed to be of all the documents which match my filter?...or of all the documents in the index?....neither seem accurate.....or all the documents in the shard? (plausible). is it possible to disable the idf calculation? In my use-case I think it will cause more problems than its worth...

Christian_Dahlqvist · April 16, 2020, 8:01am

The statistics used for the calculation are stored and by default evaluated per shard. AS you seem to have more than one shard it is possible that your two matches end up in different shards with different background statistics. I believe you can use dfs_query_then_fetch to reduce or eliminate this impact.

redec · April 16, 2020, 5:11pm

Ahh, ok...that makes sense. Thanks! Do you know, is there a way to disable the IDF calculation? If I'm understanding this correctly, the default similarity consists of the 3 basic components...term frequency and field length normalization can be individually disabled by ("index_options: docs" and "norms: false" respectively)...but I can't find a way to disable the IDF calculation...

system · May 14, 2020, 5:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why is idf different for same term in same field in same shard? Elasticsearch	3	909	July 5, 2017
Why does IDF differs on hits with same query? Elasticsearch	4	1368	July 5, 2017
Document score explanation values (maxDocs ?) Elasticsearch	3	921	July 6, 2017
Computing idf in elasticsearch Elasticsearch	5	343	July 6, 2017
Odd scoring behavior Elasticsearch	7	500	March 22, 2018

Help interpreting explain results, IDF behavior (newbie)

Related topics