Synonyms relevance help

HI,

I am trying to configured synonyms in Elasticsearch and done the sample configuration as well. But not getting expected relevancy when i am searching data.
Below is index Mapping configuration:

PUT /test_index
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_synonyms": {
            "type": "synonym",
            "synonyms": [
              "mind, brain",
              "brainstorm,brain storm"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase"
            ]
          },
          "my_search_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "my_synonyms"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

Below is sample data which i have indexed:

POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"my_field": "This is a brainstorm" }
{ "index" : { "_id" : "2" } }
{"my_field": "A different brain storm" }
{ "index" : { "_id" : "3" } }
{"my_field": "About brainstorming" }
{ "index" : { "_id" : "4" } }
{"my_field": "I had a storm in my brain" }
{ "index" : { "_id" : "5" } }
{"my_field": "I envisaged something like that" }

Below is query which i am trying:

GET test_index/_search
{
  "query": {
    "match": {
      "my_field": {
        "query": "brainstorm",
         "analyzer": "my_search_analyzer"
      }
    }
  }
}

Current Result:

 "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.8185701,
        "_source" : {
          "my_field" : "A different brain storm"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.4100728,
        "_source" : {
          "my_field" : "I had a storm in my brain"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.90928507,
        "_source" : {
          "my_field" : "This is a brainstorm"
        }
      }
    ]

I am expecting document which is matching exect with query on top and document which is matching with synonyms should come with low score.
so here my expectation is document with value "This is a brainstorm" should come at potion one.

Could you please suggest me how i can achive.

I have tried to applied boosting and weightage as well but no luck.

Thanks in advance !!!

@sagarpatel I would suggest you to use below API to understand how the score is being calculated for your documents. Understanding tf and idf is very important for relevancy use cases.

GET /test_index/_explain/<your document id>
{
  "query": {
    "match": {
      "my_field": {
        "query": "brainstorm",
         "analyzer": "my_search_analyzer"
      }
    }
  }
}

for ex:

GET /test_index/_explain/2
{
  "query": {
    "match": {
      "my_field": {
        "query": "brainstorm",
         "analyzer": "my_search_analyzer"
      }
    }
  }
}

the reason for documents 2 and 4 to get more score is match for the term storm apart from synonym which you should be able to see in the output of _explain api.

@DineshNaik Thanks for response. I have already tried this API and check the response. here, i am looking for help to boost document which is matching with original query.
so my expectation is document which is matching with original query should come on top and document which is matching with synonyms should come lower with low score.

could you plese help me if there is any way to achive this usecase.

Hello there,

Since you are indexing synonyms into your inverted index, brain storm and brainstorm are all different tokens after analyzer does its thing. So Elasticsearch on query time uses your analyzer to create tokens for brain, storm and brainstorm from your query and match multiple tokens with indexes 2 and 4, your index 2 has lesser words so tf/idf scores it higher between the two and index number 1 only matches brainstorm.

You can also see what your analyzer does to your input with this;

POST test_index/_analyze
{
  "analyzer": "my_search_analyzer",
  "text": "I had a storm in my brain"
}

I did some trying out so, you should change your index analyzer to my_analyzer;

PUT /test_index
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_synonyms": {
            "type": "synonym",
            "synonyms": [
              "mind, brain",
              "brainstorm,brain storm"
            ]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase"
            ]
          },
          "my_search_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "my_synonyms"
            ]
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

Then you want to boost your exact matches, but you also want to get hits from my_search_analyzer tokens as well so i have changed your query a bit;

GET test_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "my_field": {
              "query": "brainstorm",
              "analyzer": "my_search_analyzer"
            }
          }
        },
        {
          "match_phrase": {
            "my_field": {
              "query": "brainstorm"
            }
          }
        }
      ]
    }
  }
}

result:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 2.3491273,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.3491273,
        "_source" : {
          "my_field" : "This is a brainstorm"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.8185701,
        "_source" : {
          "my_field" : "A different brain storm"
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.4100728,
        "_source" : {
          "my_field" : "I had a storm in my brain"
        }
      }
    ]
  }
}

@can.ozdemir Thanks for you reply. it really helped me out.
could you please let me know if there is any performance issue with using multipule clause for same query.

This is a boolean query with two full-text queries, which means there are two full-text queries now but I don't think it will effect your performance a lot.

@can.ozdemir Thanks for your reply. I have marked your answer as solution as it help me for my requirements.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.