Help interpreting explain results, IDF behavior (newbie)

I have 2 documents which match my filter, and both have an identical value in the field being queried, but yet they yield vastly different scores.

Here is the returned result:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 6,
    "successful" : 6,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.72951484,
    "hits" : [
      {
        "_shard" : "[taskassignment][2]",
        "_node" : "yCeD_OyyQqqbBRoMgqP_ng",
        "_index" : "taskassignment",
        "_type" : "_doc",
        "_id" : "0536f1edb103480f9d7917fdb29a2f09",
        "_score" : 0.72951484,
        "_source" : {
          "tenantSlug" : "0536f1edb103480f9d7917fdb29a2f09",
          "project" : {
            "name" : "asd",
          },
        },
        "_explanation" : {
          "value" : 0.72951484,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.72951484,
              "description" : "weight(project.name.ngram:a in 11) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.72951484,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.72951484,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 13,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 27,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.45454544,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.0,
              "description" : "match on required clause, product of:",
              "details" : [
                {
                  "value" : 0.0,
                  "description" : "# clause",
                  "details" : [ ]
                },
                {
                  "value" : 1.0,
                  "description" : "tenantSlug:0536f1edb103480f9d7917fdb29a2f09",
                  "details" : [ ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[taskassignment][3]",
        "_node" : "FmUxDSnbT8qvwSkPtC3Agg",
        "_index" : "taskassignment",
        "_type" : "_doc",
        "_id" : "9536f1edb102480f9d7117fdb29a2faa",
        "_score" : 0.3276874,
        "_source" : {
          "tenantSlug" : "0536f1edb103480f9d7917fdb29a2f09",
          "project" : {
            "name" : "asd",
          },
          "task" : {
            "name" : "vbnt",
          },
        },
        "_explanation" : {
          "value" : 0.3276874,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.3276874,
              "description" : "weight(project.name.ngram:a in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.3276874,
                  "description" : "score(freq=1.0), product of:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.3276874,
                      "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 24,
                          "description" : "n, number of documents containing term",
                          "details" : [ ]
                        },
                        {
                          "value" : 33,
                          "description" : "N, total number of documents with field",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.45454544,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "freq, occurrences of term within document",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "k1, term saturation parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "b, length normalization parameter",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "dl, length of field",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.0,
                          "description" : "avgdl, average length of field",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.0,
              "description" : "match on required clause, product of:",
              "details" : [
                {
                  "value" : 0.0,
                  "description" : "# clause",
                  "details" : [ ]
                },
                {
                  "value" : 1.0,
                  "description" : "tenantSlug:0536f1edb103480f9d7917fdb29a2f09",
                  "details" : [ ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

This is the query I ran:

GET /taskassignment/_search
{
  "explain": true,
  "query": {
    "bool": {
      "must": { 
        "match": { "project.name.ngram": "a" }
      },
      "filter": {
        "term": { "tenantSlug": "0536f1edb103480f9d7917fdb29a2f09"}
      }
    }
  }
}

This is my mappings/settings:

{
  "taskassignment" : {
    "mappings" : {
      "properties" : {
        "project" : {
          "properties" : {
            "name" : {
              "type" : "text",
              "fields" : {
                "ngram" : {
                  "type" : "text",
                  "analyzer" : "ngram"
                }
              }
            },
          }
        },
        "tenantSlug" : {
          "type" : "keyword",
          "ignore_above" : 256
        },
      }
    },
    "settings" : {
      "index" : {
        "analysis" : {
          "analyzer" : {
            "ngram" : {
              "filter" : [ "lowercase" ],
              "tokenizer" : "ngram"
            }
          },
          "tokenizer" : {
            "ngram" : {
              "token_chars" : [
                "letter",
                "digit"
              ],
              "min_gram" : "1",
              "type" : "ngram",
              "max_gram" : "2"
            }
          }
        },
      }
    }
  }
}

From what I can tell, it's detecting different document counts for the idf calculation for different records within the same query....how is this possible? Like, do I understand it correctly, that's counting the number of documents that have the letter 'a' in the project.name field, right? Is that document count supposed to be of all the documents which match my filter?...or of all the documents in the index?....neither seem accurate.....or all the documents in the shard? (plausible). is it possible to disable the idf calculation? In my use-case I think it will cause more problems than its worth...

The statistics used for the calculation are stored and by default evaluated per shard. AS you seem to have more than one shard it is possible that your two matches end up in different shards with different background statistics. I believe you can use dfs_query_then_fetch to reduce or eliminate this impact.

Ahh, ok...that makes sense. Thanks! Do you know, is there a way to disable the IDF calculation? If I'm understanding this correctly, the default similarity consists of the 3 basic components...term frequency and field length normalization can be individually disabled by ("index_options: docs" and "norms: false" respectively)...but I can't find a way to disable the IDF calculation...

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.