Question regarding the IDF value for the optional fields in a document

kaykanloo · November 30, 2021, 8:14pm

Hi,

I'm relatively new to Elasticsearch, so I'm probably missing something trivial here, but I'm having issues with the relevancy score of the search results when it comes to optional fields in documents. Consider the following example:

Test data:

DELETE /my-index

PUT /my-index

POST /my-index/_bulk
{"index":{"_id":"1"}}
{"required_field":"RareWord"}
{"index":{"_id":"2"}}
{"required_field":"RareWord"}
{"index":{"_id":"3"}}
{"required_field":"CommonWord"}
{"index":{"_id":"4"}}
{"required_field":"CommonWord"}
{"index":{"_id":"5"}}
{"required_field":"CommonWord"}
{"index":{"_id":"6"}}
{"required_field":"CommonWord"}
{"index":{"_id":"7"}}
{"required_field":"CommonWord"}
{"index":{"_id":"8"}}
{"required_field":"CommonWord"}
{"index":{"_id":"9"}}
{"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"}
{"index":{"_id":"10"}}
{"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"}

Search Query:
If I run a search query similar to one below:

GET /my-index/_search
{"query":{"multi_match":{"query":"RareWord AnotherRareWord","fields":["required_field","optional_field"]}}}

Expectation
The end-user would expect Document #9 and #10 to score higher than others, because they contain the exact two words of the search query in their optional_field

Reality
Document #1 would score better than #10, even though it only contains one of the the two words of the search query; which is the opposite of what end-users most likely expect.

A closer look at _explain
Here is the _explain results of running the same search query for Document #1:

{
  "_index" : "my-index",
  "_type" : "_doc",
  "_id" : "1",
  "matched" : true,
  "explanation" : {
    "value" : 1.4816045,
    "description" : "max of:",
    "details" : [
      {
        "value" : 1.4816045,
        "description" : "sum of:",
        "details" : [
          {
            "value" : 1.4816045,
            "description" : "weight(required_field:rareword in 0) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 1.4816045,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.4816046,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 2,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 10,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.45454544,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.0,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

And here is the _explain results of running the same search query for Document #10:

{
  "_index" : "my-index",
  "_type" : "_doc",
  "_id" : "10",
  "matched" : true,
  "explanation" : {
    "value" : 0.36464313,
    "description" : "max of:",
    "details" : [
      {
        "value" : 0.36464313,
        "description" : "sum of:",
        "details" : [
          {
            "value" : 0.18232156,
            "description" : "weight(optional_field:rareword in 9) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 0.18232156,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.18232156,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 2,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 2,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.45454544,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 2.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 2.0,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "value" : 0.18232156,
            "description" : "weight(optional_field:anotherrareword in 9) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 0.18232156,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.18232156,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 2,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 2,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.45454544,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 2.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 2.0,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

As you can see, Document #10 scores worse, mainly due to the lower IDF value (0.18232156). Looking closely, it's because IDF uses N, total number of documents with field: 2 instead of simply considering the total number of the documents in the index: 10.

Question
My question is that is there any way that I could force multi_match query to consider all the documents (instead of only those that contain the field) when computing the IDF value for an optional field, hence resulting in a relevance score which is closer to the expectations of the end-users?
Or alternatively, is there a better way to write the search query, so I get the expected results?

Any help would be greatly appreciated. Thanks.

Regards,
Kaykanloo

kaykanloo · December 3, 2021, 3:45pm

Bumping this post.

Any suggestion would be immensely helpful. Thanks.

Regards,
Kaykanloo

kaykanloo · December 6, 2021, 7:54pm

Bumping this post again.

Can anyone please confirm if this is indeed a missing feature or I'm simply missing something here. I would like to open a feature request ticket on Github if what I'm trying to achieve is not currently supported.

Any feedback would be greatly appreciated. Thanks.

kaykanloo · December 22, 2021, 8:18pm

Still haven't found any solutions to this. Any feedback would be greatly appreciated.

system · January 19, 2022, 8:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can we manipulate the idf calculation field in Elasticsearch? Elasticsearch	7	735	July 6, 2017
BM25 how to scoring ignore idf or set scope for total number of documents with search field Elasticsearch	1	680	January 26, 2022
Help me understand how ES calculate the score to match query Elasticsearch	5	1339	July 6, 2017
How to make doc which has more different words score higher? Elasticsearch	2	254	October 20, 2021
More terms but lower score Elasticsearch	2	753	July 5, 2017

Question regarding the IDF value for the optional fields in a document

Related topics