Synonyms result scoring

Bilal · October 15, 2018, 3:18pm

I'm reading this article about Patterns for Synonyms in Elasticsearch and I have some questions about the results that I got, here is the mappings and settings I used:

PUT wheat_syn
{
  "mappings": {
    "wheat": {
      "properties": {
        "description": {
          "type": "text",
          "analyzer": "syn_text",
          "fields": {
            "keyword": {
             "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "autophrase_syn": {
          "type": "synonym",
          "synonyms": ["triticum aestivum => triticum_aestivum",
                       "bread wheat => bread_wheat"]
        },
        "wheat_syn": {
          "type": "synonym",
          "tokenizer": "keyword",
          "synonyms": ["triticum_aestivum, bread_wheat, wheat"]
        }
      },
      "analyzer": {
        "syn_text": {
          "tokenizer": "standard",
          "filter": ["lowercase", "autophrase_syn", "wheat_syn"]
        }
      }
    }
  }
}

The Documents:

PUT wheat_syn/wheat/_bulk
{ "index" : { "_id" : "1" } }
{ "description": "Wheat is a grass widely cultivated for its seed, a cereal grain which is a worldwide staple food." }
{ "index" : { "_id" : "2" } }
{ "description": "The scientific name is Triticum aestivum." }
{ "index" : { "_id" : "3" } }
{ "description": "bread wheat is good for health." }

The query:

GET wheat_syn/wheat/_search
{
  "query": {
    "match": {
      "description": "wheat"
    }
  }
}

After executing the query, I got the following result:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.48155478,
    "hits": [
      {
        "_index": "wheat_syn",
        "_type": "wheat",
        "_id": "2",
        "_score": 0.48155478,
        "_source": {
          "description": "The scientific name is Triticum aestivum."
        }
      },
      {
        "_index": "wheat_syn",
        "_type": "wheat",
        "_id": "3",
        "_score": 0.48155478,
        "_source": {
          "description": "bread wheat is good for health."
        }
      },
      {
        "_index": "wheat_syn",
        "_type": "wheat",
        "_id": "1",
        "_score": 0.46197122,
        "_source": {
          "description": "Wheat is a grass widely cultivated for its seed, a cereal grain which is a worldwide staple food."
        }
      }
    ]
  }
}

Now, my questions are:

I was expecting to get the sentence Wheat is a grass widely cultivated for its seed, a cereal grain which is a worldwide staple food first since the user was looking for this word, why isn't the case ?
Why there is a difference in the score and if it's depend on the position of the queried sentence/word in the description field, the third sentence in the results should be first right ? (this is a simple example, the more documents I add the higher difference of score is).

Thank you !

abdon · October 15, 2018, 4:34pm

Great questions!

Synonyms score the same as an exact match. There is no automatic preference for exact matches.
The length of a field is an important factor that determines the score: shorter fields score higher than longer fields. What you're seeing is that the documents with a shorter field (6 terms) score higher than the document with a longer field (18 terms).

If you want to understand how a score is calculated, you can add "explain": true to a search request, and Elasticsearch will tell you exactly how the score for each hit was calculated:

GET wheat_syn/wheat/_search
{
  "explain": true, 
  "query": {
    "match": {
      "description": "wheat"
    }
  }
}

Some suggestions: if you are looking at scores for a small dataset like this, consider creating the index with one shard (instead of the default 5). Otherwise the scoring may be unexpected as explained here.

Also, consider indexing the data twice: once with synonyms and once without synonyms. You can do that by using multi-fields. Your index creation command would become:

PUT wheat_syn
{
  "mappings": {
    "wheat": {
      "properties": {
        "description": {
          "type": "text",
          "fields": {
            "synonyms": {
              "type": "text",
              "analyzer": "syn_text"
            },
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "autophrase_syn": {
          "type": "synonym",
          "synonyms": [
            "triticum aestivum => triticum_aestivum",
            "bread wheat => bread_wheat"
          ]
        },
        "wheat_syn": {
          "type": "synonym",
          "tokenizer": "keyword",
          "synonyms": [
            "triticum_aestivum, bread_wheat, wheat"
          ]
        }
      },
      "analyzer": {
        "syn_text": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autophrase_syn",
            "wheat_syn"
          ]
        }
      }
    }
  }
}

Now, you can use a bool query to search simultaneously with and without synonyms:

GET wheat_syn/wheat/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "description.synonyms": "wheat"
          }
        }
      ],
      "should": [
        {
          "match": {
            "description": {
              "query": "wheat"
            }
          }
        }
      ]
    }
  }
}

Those documents that contain the exact term are the only ones that match the should clause (which does not use synonyms). As a result, those docs will get a higher score and rank at the top of the results.

Synonyms can be tricky to set up correctly. We will soon launch an on-demand training course about synonyms that covers topics like this.

Bilal · October 16, 2018, 9:51am

Thank you for the detailed answer, the example you provided worked perfectly, now I understand why I had this unexpected results.

In the link you provided (Relevence is Broken !) it says:

In practice, this is not a problem. The differences between local and global IDF diminish the more documents that you add to the index. With real-world volumes of data, the local IDFs soon even out. The problem is not that relevance is broken but that there is too little data.

Which explains a lot ! so in case I have a large dataset, Can your suggestions still be applied ? I mean, I can increase the number of shards and index my data twice without problems ? are there anythings that I shall take into account ?

abdon · October 16, 2018, 10:16am

Yes, all of this should scale to however much data you have, and however many shards you have. I was just pointing out that with just three documents it does not make sense to have more than one shard.

softwaredoug · November 10, 2018, 2:02pm

Just one note, this is a good default. But it's not always a good idea to assume that synonyms should be scored as equivalent. There's a set of practices that have built up in the search relevance community around using synonyms for a variety of reasons, including expansion to broadening terms which people want to score lower.

I've put together a PR to make this behavior configurable in match / multi_match queries (similar to a Solr patch accepted a while back - which makes this behavior configurable)

github.com/elastic/elasticsearch

Synonym Query Style (configurable single term syn queries)

elastic:master ← o19s:synonym-query-configurable

opened 07:34PM - 09 Nov 18 UTC

softwaredoug

+217 -11

In Lucene, SynonymQuery became the default behavior for single term synonym quer…ies, which ES 6.0 inherited. While this is a good default, it interferes with other legitimate uses of synonyms. You don't always want to blend document frequencies of the terms. It's common to want to expand a term to a broadening term, but maintaining the specificity of the original term. (ie `jeans => jeans, trousers`). I and others have written about [these techniques](https://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/) for implementing hierarchical vocabularies. In these cases blending ends up doing more harm, so this PR makes this behavior configurable. This PR introduces an option to the match query, `synonym_query_style` which can have the values `blended` (default), `most_terms`, or `best_terms`. ``` { "match": { "text": { "query": "blue jeans", "synonym_query_style": "most_terms" } } } ``` with a synonym file `jeans, trousers` would turn into a search: `text:jeans text:trousers`. This is basically the pre 6.0 behavior. Using `best_terms` changes the synonym query to a dismax `text:jeans | text:trousers` which tends to ignore the broader term in the narrower term is present. One note on this PR, - tests seem to be failing locally for me for unrelated functionality (transport client stuff...) - any feedback on where to place a test is greatly appreciated. I looked for how other match query options were tested and did not find anything.

system · December 8, 2018, 2:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Synonyms relevance help Elasticsearch	7	558	December 27, 2021
Synonym Graph giving incorrect results Elasticsearch	1	152	December 23, 2023
Query with synonym doesn't work as expected Elasticsearch	6	2522	July 5, 2017
Why doesn't this Synonym work? Elasticsearch	13	2990	July 5, 2017
Queries for stem words or synonyms dont yield results Elasticsearch	2	508	July 13, 2018

Synonyms result scoring

Related topics