Synonyms relevance help

sagarpatel · November 25, 2021, 12:35pm

HI,

I am trying to configured synonyms in Elasticsearch and done the sample configuration as well. But not getting expected relevancy when i am searching data.
Below is index Mapping configuration:

PUT /test_index
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_synonyms": {
            "type": "synonym",
            "synonyms": [
              "mind, brain",
              "brainstorm,brain storm"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase"
            ]
          },
          "my_search_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "my_synonyms"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

Below is sample data which i have indexed:

POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"my_field": "This is a brainstorm" }
{ "index" : { "_id" : "2" } }
{"my_field": "A different brain storm" }
{ "index" : { "_id" : "3" } }
{"my_field": "About brainstorming" }
{ "index" : { "_id" : "4" } }
{"my_field": "I had a storm in my brain" }
{ "index" : { "_id" : "5" } }
{"my_field": "I envisaged something like that" }

Below is query which i am trying:

GET test_index/_search
{
  "query": {
    "match": {
      "my_field": {
        "query": "brainstorm",
         "analyzer": "my_search_analyzer"
      }
    }
  }
}

Current Result:

 "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.8185701,
        "_source" : {
          "my_field" : "A different brain storm"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.4100728,
        "_source" : {
          "my_field" : "I had a storm in my brain"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.90928507,
        "_source" : {
          "my_field" : "This is a brainstorm"
        }
      }
    ]

I am expecting document which is matching exect with query on top and document which is matching with synonyms should come with low score.
so here my expectation is document with value "This is a brainstorm" should come at potion one.

Could you please suggest me how i can achive.

I have tried to applied boosting and weightage as well but no luck.

Thanks in advance !!!

DineshNaik · November 26, 2021, 12:00pm

@sagarpatel I would suggest you to use below API to understand how the score is being calculated for your documents. Understanding tf and idf is very important for relevancy use cases.

GET /test_index/_explain/<your document id>
{
  "query": {
    "match": {
      "my_field": {
        "query": "brainstorm",
         "analyzer": "my_search_analyzer"
      }
    }
  }
}

for ex:

GET /test_index/_explain/2
{
  "query": {
    "match": {
      "my_field": {
        "query": "brainstorm",
         "analyzer": "my_search_analyzer"
      }
    }
  }
}

the reason for documents 2 and 4 to get more score is match for the term storm apart from synonym which you should be able to see in the output of _explain api.

sagarpatel · November 26, 2021, 12:20pm

@DineshNaik Thanks for response. I have already tried this API and check the response. here, i am looking for help to boost document which is matching with original query.
so my expectation is document which is matching with original query should come on top and document which is matching with synonyms should come lower with low score.

could you plese help me if there is any way to achive this usecase.

can.ozdemir · November 26, 2021, 1:25pm

Hello there,

Since you are indexing synonyms into your inverted index, brain storm and brainstorm are all different tokens after analyzer does its thing. So Elasticsearch on query time uses your analyzer to create tokens for brain, storm and brainstorm from your query and match multiple tokens with indexes 2 and 4, your index 2 has lesser words so tf/idf scores it higher between the two and index number 1 only matches brainstorm.

You can also see what your analyzer does to your input with this;

POST test_index/_analyze
{
  "analyzer": "my_search_analyzer",
  "text": "I had a storm in my brain"
}

I did some trying out so, you should change your index analyzer to my_analyzer;

PUT /test_index
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_synonyms": {
            "type": "synonym",
            "synonyms": [
              "mind, brain",
              "brainstorm,brain storm"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase"
            ]
          },
          "my_search_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "my_synonyms"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

Then you want to boost your exact matches, but you also want to get hits from my_search_analyzer tokens as well so i have changed your query a bit;

GET test_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "my_field": {
              "query": "brainstorm",
              "analyzer": "my_search_analyzer"
            }
          }
        },
        {
          "match_phrase": {
            "my_field": {
              "query": "brainstorm"
            }
          }
        }
      ]
    }
  }
}

result:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 2.3491273,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.3491273,
        "_source" : {
          "my_field" : "This is a brainstorm"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.8185701,
        "_source" : {
          "my_field" : "A different brain storm"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.4100728,
        "_source" : {
          "my_field" : "I had a storm in my brain"
        }
      }
    ]
  }
}

sagarpatel · November 26, 2021, 2:09pm

@can.ozdemir Thanks for you reply. it really helped me out.
could you please let me know if there is any performance issue with using multipule clause for same query.

can.ozdemir · November 26, 2021, 2:13pm

This is a boolean query with two full-text queries, which means there are two full-text queries now but I don't think it will effect your performance a lot.

sagarpatel · November 29, 2021, 5:38am

@can.ozdemir Thanks for your reply. I have marked your answer as solution as it help me for my requirements.

system · December 27, 2021, 5:38am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Synonyms not getting applied Elasticsearch	2	374	February 13, 2020
Synonyms result scoring Elasticsearch	5	3487	December 8, 2018
I have got a little Problem with my synonym filter Elasticsearch	5	565	July 6, 2017
Query with synonym doesn't work as expected Elasticsearch	6	2522	July 5, 2017
Using synonym_graph means non-synonyms are not found Elasticsearch	9	272	March 22, 2023

Synonyms relevance help

Related topics