Question regarding the IDF value for the optional fields in a document

Hi,

I'm relatively new to Elasticsearch, so I'm probably missing something trivial here, but I'm having issues with the relevancy score of the search results when it comes to optional fields in documents. Consider the following example:

Test data:

DELETE /my-index

PUT /my-index

POST /my-index/_bulk
{"index":{"_id":"1"}}
{"required_field":"RareWord"}
{"index":{"_id":"2"}}
{"required_field":"RareWord"}
{"index":{"_id":"3"}}
{"required_field":"CommonWord"}
{"index":{"_id":"4"}}
{"required_field":"CommonWord"}
{"index":{"_id":"5"}}
{"required_field":"CommonWord"}
{"index":{"_id":"6"}}
{"required_field":"CommonWord"}
{"index":{"_id":"7"}}
{"required_field":"CommonWord"}
{"index":{"_id":"8"}}
{"required_field":"CommonWord"}
{"index":{"_id":"9"}}
{"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"}
{"index":{"_id":"10"}}
{"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"}

Search Query:
If I run a search query similar to one below:

GET /my-index/_search
{"query":{"multi_match":{"query":"RareWord AnotherRareWord","fields":["required_field","optional_field"]}}}

Expectation
The end-user would expect Document #9 and #10 to score higher than others, because they contain the exact two words of the search query in their optional_field

Reality
Document #1 would score better than #10, even though it only contains one of the the two words of the search query; which is the opposite of what end-users most likely expect.

A closer look at _explain
Here is the _explain results of running the same search query for Document #1:

{
  "_index" : "my-index",
  "_type" : "_doc",
  "_id" : "1",
  "matched" : true,
  "explanation" : {
    "value" : 1.4816045,
    "description" : "max of:",
    "details" : [
      {
        "value" : 1.4816045,
        "description" : "sum of:",
        "details" : [
          {
            "value" : 1.4816045,
            "description" : "weight(required_field:rareword in 0) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 1.4816045,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.4816046,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 2,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 10,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.45454544,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.0,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

And here is the _explain results of running the same search query for Document #10:

{
  "_index" : "my-index",
  "_type" : "_doc",
  "_id" : "10",
  "matched" : true,
  "explanation" : {
    "value" : 0.36464313,
    "description" : "max of:",
    "details" : [
      {
        "value" : 0.36464313,
        "description" : "sum of:",
        "details" : [
          {
            "value" : 0.18232156,
            "description" : "weight(optional_field:rareword in 9) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 0.18232156,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.18232156,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 2,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 2,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.45454544,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 2.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 2.0,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          },
          {
            "value" : 0.18232156,
            "description" : "weight(optional_field:anotherrareword in 9) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 0.18232156,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.18232156,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 2,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 2,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.45454544,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 2.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 2.0,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

As you can see, Document #10 scores worse, mainly due to the lower IDF value (0.18232156). Looking closely, it's because IDF uses N, total number of documents with field: 2 instead of simply considering the total number of the documents in the index: 10.

Question
My question is that is there any way that I could force multi_match query to consider all the documents (instead of only those that contain the field) when computing the IDF value for an optional field, hence resulting in a relevance score which is closer to the expectations of the end-users?
Or alternatively, is there a better way to write the search query, so I get the expected results?

Any help would be greatly appreciated. Thanks.

Regards,
Kaykanloo

Bumping this post.

Any suggestion would be immensely helpful. Thanks.

Regards,
Kaykanloo

Bumping this post again.

Can anyone please confirm if this is indeed a missing feature or I'm simply missing something here. I would like to open a feature request ticket on Github if what I'm trying to achieve is not currently supported.

Any feedback would be greatly appreciated. Thanks.

Still haven't found any solutions to this. Any feedback would be greatly appreciated.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.