Synonym_graph + match_phrase: Unexpected high score due to the sum up of IDF of all matched synonym words

Patrick_N · September 3, 2021, 9:28am

(ES 7.10) I use synonym_graph (usa, united states north america) for the search_analyzer of a field.

When I perform a match_phrase which contains usa, the score of any matching doc which contains the term usa will be substantial higher than the rest.

GET test/_search
{
  "explain": true,
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {"name": {"query": "usa"}}
        },
        {
          "match_phrase": {"name": {"query": "happy boy"}}
        }
      ]
    }
  }
}

On a closer look, it's because the IDF of that matched doc is the sum of the IDF of all the terms in the matched synonym (usa, united, states, north, america), instead of just the IDF of the matching term (usa).

Explanation (I just extracted only the part for the 1st hit):

      {
        "_shard" : "[test][0]",
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 6.9519825,
        "_source" : {
          "name" : "usa happy boy"
        },
        "_explanation" : {
          "value" : 6.9519825,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 6.215455,
              "description" : "weight(spanOr([spanNear([name:united, name:states, name:north, name:america], 0, true), name:usa]) in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 6.215455,
                  "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                    },
                    {
                      "value" : 6.019864,
                      "description" : "idf, sum of:",
                      "details" : [
                        {
                          "value" : 1.2039728,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 1,
                              "description" : "n, number of documents containing term",
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field",
                            }
                          ]
                        },
                        {
                          "value" : 1.2039728,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 1,
                              "description" : "n, number of documents containing term",
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field",
                            }
                          ]
                        },
                        {
                          "value" : 1.2039728,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 1,
                              "description" : "n, number of documents containing term",
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field",
                            }
                          ]
                        },
                        {
                          "value" : 1.2039728,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 1,
                              "description" : "n, number of documents containing term",
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field",
                            }
                          ]
                        },
                        {
                          "value" : 1.2039728,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 1,
                              "description" : "n, number of documents containing term",
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field",
                            }
                          ]
                        }
                      ]
                    },
                    {
                      "value" : 0.46931404,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      ("details of tf removed)
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.73652726,
              "description" : """weight(name:"happy boy" in 0) [PerFieldSimilarity], result of:""",
              "details" : [
                {
                  "value" : 0.73652726,
                  "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                    },
                    {
                      "value" : 0.7133499,
                      "description" : "idf, sum of:",
                      "details" : [
                        {
                          "value" : 0.35667494,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 3,
                              "description" : "n, number of documents containing term"
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field"
                            }
                          ]
                        },
                        {
                          "value" : 0.35667494,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 3,
                              "description" : "n, number of documents containing term"
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field"
                            }
                          ]
                        }
                      ]
                    },
                    {
                      "value" : 0.46931404,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      (details of tf removed)
                    }
                  ]
                }
              ]
            }
          ]
        }
      },

This is causing a problem, because in my program, I am search multiple fields, and this unexpected high IDF for any doc which matches a synonym is causing some ranking problems.

E.g. if I do the below search, I won't run into that high IDF problem for the 1st matched doc

GET test/_search
{
  "explain": true,
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {"name": {"query": "germany"}}
        },
        {
          "match_phrase": {"name": {"query": "happy boy"}}
        }
      ]
    }
  }
}

Since only the term usa from the search matches, can we count only the IDF of that term? Is that possible?

Setup:

DELETE test

PUT test
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "synonyms": {
          "type": "synonym_graph",
          "synonyms": [
            "usa, united states north america"
          ]
        }
      },
      "analyzer": {
        "synonym_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "synonyms"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "whitespace",
        "search_analyzer": "synonym_analyzer"
      }
    }
  }
}


#### Add testing doc
PUT test/_doc/1
{ "name" : "usa happy boy" }

PUT test/_doc/2
{ "name" : "orange happy boy" }

PUT test/_doc/3
{ "name" : "germany happy boy" }

PUT test/_doc/4
{ "name" : "united states north america" }

system · October 1, 2021, 9:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Does Elasticsearch score different length shingles with the same IDF? Elasticsearch	4	1555	May 7, 2018
Weird scoring when using multi word synonyms Elasticsearch	7	1932	December 13, 2018
Synonyms result scoring Elasticsearch	5	3595	December 8, 2018
Match synonyms and match exact word get different scores Elasticsearch	1	635	February 20, 2019
Multiple synonyms contribute to the score Elasticsearch	5	913	July 6, 2017

Synonym_graph + match_phrase: Unexpected high score due to the sum up of IDF of all matched synonym words

Related topics