Synonyms result scoring


(Bilal) #1

I'm reading this article about Patterns for Synonyms in Elasticsearch and I have some questions about the results that I got, here is the mappings and settings I used:

PUT wheat_syn
{
  "mappings": {
    "wheat": {
      "properties": {
        "description": {
          "type": "text",
          "analyzer": "syn_text",
          "fields": {
            "keyword": {
             "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "autophrase_syn": {
          "type": "synonym",
          "synonyms": ["triticum aestivum => triticum_aestivum",
                       "bread wheat => bread_wheat"]
        },
        "wheat_syn": {
          "type": "synonym",
          "tokenizer": "keyword",
          "synonyms": ["triticum_aestivum, bread_wheat, wheat"]
        }
      },
      "analyzer": {
        "syn_text": {
          "tokenizer": "standard",
          "filter": ["lowercase", "autophrase_syn", "wheat_syn"]
        }
      }
    }
  }
}

The Documents:

PUT wheat_syn/wheat/_bulk
{ "index" : { "_id" : "1" } }
{ "description": "Wheat is a grass widely cultivated for its seed, a cereal grain which is a worldwide staple food." }
{ "index" : { "_id" : "2" } }
{ "description": "The scientific name is Triticum aestivum." }
{ "index" : { "_id" : "3" } }
{ "description": "bread wheat is good for health." }

The query:

GET wheat_syn/wheat/_search
{
  "query": {
    "match": {
      "description": "wheat"
    }
  }
}

After executing the query, I got the following result:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.48155478,
    "hits": [
      {
        "_index": "wheat_syn",
        "_type": "wheat",
        "_id": "2",
        "_score": 0.48155478,
        "_source": {
          "description": "The scientific name is Triticum aestivum."
        }
      },
      {
        "_index": "wheat_syn",
        "_type": "wheat",
        "_id": "3",
        "_score": 0.48155478,
        "_source": {
          "description": "bread wheat is good for health."
        }
      },
      {
        "_index": "wheat_syn",
        "_type": "wheat",
        "_id": "1",
        "_score": 0.46197122,
        "_source": {
          "description": "Wheat is a grass widely cultivated for its seed, a cereal grain which is a worldwide staple food."
        }
      }
    ]
  }
}

Now, my questions are:

  1. I was expecting to get the sentence Wheat is a grass widely cultivated for its seed, a cereal grain which is a worldwide staple food first since the user was looking for this word, why isn't the case ?
  2. Why there is a difference in the score and if it's depend on the position of the queried sentence/word in the description field, the third sentence in the results should be first right ? (this is a simple example, the more documents I add the higher difference of score is).

Thank you !


(Abdon Pijpelink) #2

Great questions!

  1. Synonyms score the same as an exact match. There is no automatic preference for exact matches.
  2. The length of a field is an important factor that determines the score: shorter fields score higher than longer fields. What you're seeing is that the documents with a shorter field (6 terms) score higher than the document with a longer field (18 terms).

If you want to understand how a score is calculated, you can add "explain": true to a search request, and Elasticsearch will tell you exactly how the score for each hit was calculated:

GET wheat_syn/wheat/_search
{
  "explain": true, 
  "query": {
    "match": {
      "description": "wheat"
    }
  }
}

Some suggestions: if you are looking at scores for a small dataset like this, consider creating the index with one shard (instead of the default 5). Otherwise the scoring may be unexpected as explained here.

Also, consider indexing the data twice: once with synonyms and once without synonyms. You can do that by using multi-fields. Your index creation command would become:

PUT wheat_syn
{
  "mappings": {
    "wheat": {
      "properties": {
        "description": {
          "type": "text",
          "fields": {
            "synonyms": {
              "type": "text",
              "analyzer": "syn_text"
            },
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  },
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "autophrase_syn": {
          "type": "synonym",
          "synonyms": [
            "triticum aestivum => triticum_aestivum",
            "bread wheat => bread_wheat"
          ]
        },
        "wheat_syn": {
          "type": "synonym",
          "tokenizer": "keyword",
          "synonyms": [
            "triticum_aestivum, bread_wheat, wheat"
          ]
        }
      },
      "analyzer": {
        "syn_text": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autophrase_syn",
            "wheat_syn"
          ]
        }
      }
    }
  }
}

Now, you can use a bool query to search simultaneously with and without synonyms:

GET wheat_syn/wheat/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "description.synonyms": "wheat"
          }
        }
      ],
      "should": [
        {
          "match": {
            "description": {
              "query": "wheat"
            }
          }
        }
      ]
    }
  }
}

Those documents that contain the exact term are the only ones that match the should clause (which does not use synonyms). As a result, those docs will get a higher score and rank at the top of the results.

Synonyms can be tricky to set up correctly. We will soon launch an on-demand training course about synonyms that covers topics like this.


Using taxonomy and synonyms in elasticsearch 6
(Bilal) #3

Thank you for the detailed answer, the example you provided worked perfectly, now I understand why I had this unexpected results.

In the link you provided (Relevence is Broken !) it says:

In practice, this is not a problem. The differences between local and global IDF diminish the more documents that you add to the index. With real-world volumes of data, the local IDFs soon even out. The problem is not that relevance is broken but that there is too little data.

Which explains a lot ! so in case I have a large dataset, Can your suggestions still be applied ? I mean, I can increase the number of shards and index my data twice without problems ? are there anythings that I shall take into account ?


(Abdon Pijpelink) #4

Yes, all of this should scale to however much data you have, and however many shards you have. I was just pointing out that with just three documents it does not make sense to have more than one shard. :slight_smile:


(Doug Turnbull) #5

Just one note, this is a good default. But it's not always a good idea to assume that synonyms should be scored as equivalent. There's a set of practices that have built up in the search relevance community around using synonyms for a variety of reasons, including expansion to broadening terms which people want to score lower.

I've put together a PR to make this behavior configurable in match / multi_match queries (similar to a Solr patch accepted a while back - which makes this behavior configurable)


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.