Synonyms result scoring

Great questions!

  1. Synonyms score the same as an exact match. There is no automatic preference for exact matches.
  2. The length of a field is an important factor that determines the score: shorter fields score higher than longer fields. What you're seeing is that the documents with a shorter field (6 terms) score higher than the document with a longer field (18 terms).

If you want to understand how a score is calculated, you can add "explain": true to a search request, and Elasticsearch will tell you exactly how the score for each hit was calculated:

GET wheat_syn/wheat/_search
{
  "explain": true, 
  "query": {
    "match": {
      "description": "wheat"
    }
  }
}

Some suggestions: if you are looking at scores for a small dataset like this, consider creating the index with one shard (instead of the default 5). Otherwise the scoring may be unexpected as explained here.

Also, consider indexing the data twice: once with synonyms and once without synonyms. You can do that by using multi-fields. Your index creation command would become:

PUT wheat_syn
{
  "mappings": {
    "wheat": {
      "properties": {
        "description": {
          "type": "text",
          "fields": {
            "synonyms": {
              "type": "text",
              "analyzer": "syn_text"
            },
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "autophrase_syn": {
          "type": "synonym",
          "synonyms": [
            "triticum aestivum => triticum_aestivum",
            "bread wheat => bread_wheat"
          ]
        },
        "wheat_syn": {
          "type": "synonym",
          "tokenizer": "keyword",
          "synonyms": [
            "triticum_aestivum, bread_wheat, wheat"
          ]
        }
      },
      "analyzer": {
        "syn_text": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autophrase_syn",
            "wheat_syn"
          ]
        }
      }
    }
  }
}

Now, you can use a bool query to search simultaneously with and without synonyms:

GET wheat_syn/wheat/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "description.synonyms": "wheat"
          }
        }
      ],
      "should": [
        {
          "match": {
            "description": {
              "query": "wheat"
            }
          }
        }
      ]
    }
  }
}

Those documents that contain the exact term are the only ones that match the should clause (which does not use synonyms). As a result, those docs will get a higher score and rank at the top of the results.

Synonyms can be tricky to set up correctly. We will soon launch an on-demand training course about synonyms that covers topics like this.

1 Like