Synonym_graph + match_phrase: Unexpected high score due to the sum up of IDF of all matched synonym words

(ES 7.10) I use synonym_graph (usa, united states north america) for the search_analyzer of a field.

When I perform a match_phrase which contains usa, the score of any matching doc which contains the term usa will be substantial higher than the rest.

GET test/_search
{
  "explain": true,
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {"name": {"query": "usa"}}
        },
        {
          "match_phrase": {"name": {"query": "happy boy"}}
        }
      ]
    }
  }
}

On a closer look, it's because the IDF of that matched doc is the sum of the IDF of all the terms in the matched synonym (usa, united, states, north, america), instead of just the IDF of the matching term (usa).

Explanation (I just extracted only the part for the 1st hit):

      {
        "_shard" : "[test][0]",
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 6.9519825,
        "_source" : {
          "name" : "usa happy boy"
        },
        "_explanation" : {
          "value" : 6.9519825,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 6.215455,
              "description" : "weight(spanOr([spanNear([name:united, name:states, name:north, name:america], 0, true), name:usa]) in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 6.215455,
                  "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                    },
                    {
                      "value" : 6.019864,
                      "description" : "idf, sum of:",
                      "details" : [
                        {
                          "value" : 1.2039728,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 1,
                              "description" : "n, number of documents containing term",
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field",
                            }
                          ]
                        },
                        {
                          "value" : 1.2039728,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 1,
                              "description" : "n, number of documents containing term",
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field",
                            }
                          ]
                        },
                        {
                          "value" : 1.2039728,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 1,
                              "description" : "n, number of documents containing term",
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field",
                            }
                          ]
                        },
                        {
                          "value" : 1.2039728,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 1,
                              "description" : "n, number of documents containing term",
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field",
                            }
                          ]
                        },
                        {
                          "value" : 1.2039728,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 1,
                              "description" : "n, number of documents containing term",
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field",
                            }
                          ]
                        }
                      ]
                    },
                    {
                      "value" : 0.46931404,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      ("details of tf removed)
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.73652726,
              "description" : """weight(name:"happy boy" in 0) [PerFieldSimilarity], result of:""",
              "details" : [
                {
                  "value" : 0.73652726,
                  "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                  "details" : [
                    {
                      "value" : 2.2,
                      "description" : "boost",
                    },
                    {
                      "value" : 0.7133499,
                      "description" : "idf, sum of:",
                      "details" : [
                        {
                          "value" : 0.35667494,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 3,
                              "description" : "n, number of documents containing term"
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field"
                            }
                          ]
                        },
                        {
                          "value" : 0.35667494,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 3,
                              "description" : "n, number of documents containing term"
                            },
                            {
                              "value" : 4,
                              "description" : "N, total number of documents with field"
                            }
                          ]
                        }
                      ]
                    },
                    {
                      "value" : 0.46931404,
                      "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                      (details of tf removed)
                    }
                  ]
                }
              ]
            }
          ]
        }
      },

This is causing a problem, because in my program, I am search multiple fields, and this unexpected high IDF for any doc which matches a synonym is causing some ranking problems.

E.g. if I do the below search, I won't run into that high IDF problem for the 1st matched doc

GET test/_search
{
  "explain": true,
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {"name": {"query": "germany"}}
        },
        {
          "match_phrase": {"name": {"query": "happy boy"}}
        }
      ]
    }
  }
}

Since only the term usa from the search matches, can we count only the IDF of that term? Is that possible?

Setup:

DELETE test

PUT test
{
  "settings": {
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "synonyms": {
          "type": "synonym_graph",
          "synonyms": [
            "usa, united states north america"
          ]
        }
      },
      "analyzer": {
        "synonym_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "synonyms"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "whitespace",
        "search_analyzer": "synonym_analyzer"
      }
    }
  }
}


#### Add testing doc
PUT test/_doc/1
{ "name" : "usa happy boy" }

PUT test/_doc/2
{ "name" : "orange happy boy" }

PUT test/_doc/3
{ "name" : "germany happy boy" }

PUT test/_doc/4
{ "name" : "united states north america" }

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.