Multiple word synonims does not affect score in query


#1

Hello,
ES newbie here, looking for help in understanding what's wrong.

Let's consider this index mapping, where I define some synonims for motobike models :

   {
  "settings": {
    "analysis": {
      "char_filter": {
        "replace": {
          "type": "mapping",
          "mappings": [
            "&=> and "
          ]
        }
      },
      "filter": {
        "word_delimiter": {
          "type": "word_delimiter",
          "split_on_numerics": "false",
          "split_on_case_change": "true",
          "generate_word_parts": "true",
          "generate_number_parts": "true",
          "catenate_all": "true",
          "preserve_original": "true",
          "catenate_numbers": "true"
        },
        "custom_synonym": {
          "type": "synonym",
          "lenient": "true",
          "synonyms": [
            "r 1200 r , r1200 r, r 1200r, r1200r",
            "r 1150 r, r1150 r, r 1150r, r 1150 r, r1150r"
          ]
        }
      },
      "analyzer": {
        "default": {
          "type": "custom",
          "char_filter": [
            "html_strip",
            "replace"
          ],
          "tokenizer": "whitespace",
          "filter": [
            "custom_synonym",
            "lowercase",
            "word_delimiter"
          ]
        }
      }
    }
  },
  "mappings": {
    "product": {
      "properties": {
        "pname": {
          "type": "text",
          "analyzer": "default"
        }
      }
    }
  }
}

If I put two documents in the index :

PUT test_index/product/1
{
  "pname" : "MOTORBIKE BMW R 1150 R"
}


PUT test_index/product/2
{
  "pname" : "MOTORBIKE BMW R 1200 R"
}

And then perform a match query like :

GET test_index/_search
{
    "query": {
        "match" : {
            "pname" : "MOTORBIKE R1200R"
        }
    }
}

I get both hits with the same score :

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "product",
        "_id" : "2",
        "_score" : 0.2876821,
        "_source" : {
          "pname" : "MOTORBIKE BMW R 1200 R"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "product",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "pname" : "MOTORBIKE BMW R 1150 R"
        }
      }
    ]
  }
}

My expectation was to have a bigger score on the "MOTORBIKE BMW R 1200 R" document since I have defined a synonim for the "r1200r" term : ( r 1200 r , r1200 r, r 1200r, r1200r ).

Any clue ?


(Christoph) #2

My guess is that because you defined the synonyms in lowercase, you need to put them behind the lowercase filter in your analyzer, otherwise they don't match. Have you tried that already?


#3

Hello,
I put the lowercase filter in front of custom_synonym in my default analyzer to make sure my synonims were analyzed lowercose.
I did the same query and surprisingly i get most score on the non-synonimized result :

{
  "took" : 58,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 2.0824852,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "product",
        "_id" : "1",
        "_score" : 2.0824852,
        "_source" : {
          "pname" : "MOTORBIKE BMW R 1150 R"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "product",
        "_id" : "2",
        "_score" : 2.0001311,
        "_source" : {
          "pname" : "MOTORBIKE R 1200 R"
        }
      }
    ]
  }
}

Now I am more puzzled than before...

**EDIT #1 **

I did some more tests and figured up that my synonims should be threated as keyword at index time , in order to have ie. a single synonim token for "r 1200 r" as "r1200r" synonim instead of three tokens "r, 1200, r", so I made a dedicated analyzer for this that is working well when checking the synonim terms (is this the right way ?) , but I'm struggling how can I obtain a match query for a "r1200r" query (that should be translated with its synonim "r 1200 r") against a complete document field like "MY MOTORBIKE IS A R 1200 R" ... I always get inconsistant results or wrong score...


(Christoph) #4

Multi-word synonyms can get a little bit tricky, especially when you are adding so many equivalent versions like you do. Make sure to read this section in the Elasticsearch Guid about some of the subtleties to consider and also take a look at the new Synonym Graph Token Filter which might make your synonym use case easier. Your mileage might vary.


#5

@cbuescher Thank you for pointing this out, using that filter in conjunction with keyword analyzer did the trick and improved search results dramatically !