PerFieldSimilarity for synonym expansion at query time


(Guilherme Maranhao) #1

Hi everybody,

It's said here (https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html) that using simple synonym expansion at query time is an advantage for relevance. Please, consider my scenario:

My simple expansion synonym is:

shirt, blouse

The index structure is

  {
	"my_index": {
		"aliases": {},
		"mappings": {
			"my_type": {
				"properties": {
					"name": {
						"type": "text",
						"fields": {
							"keyword": {
								"type": "keyword",
								"ignore_above": 256
							}
						}
					}
				}
			}
		},
		"settings": {
			"index": {
				"analysis": {
					"filter": {
						"brazilian_stop": {
							"type": "stop",
							"stopwords": "_brazilian_"
						},
						"synonym_filter": {
							"type": "synonym",
							"synonyms_path": "sinonimos.txt"
						}
					},
					"analyzer": {
						"synonym_brazilian_analyzer": {
							"filter": [
								"lowercase",
								"asciifolding",
								"synonym_filter",
								"brazilian_stop"
							],
						"tokenizer": "standard"
						}
					}
				}
			}
		}
	}
}

Note that I'm not applying the synonym analyzer at index time.

And only 3 documents:

_id = 1
{
    name: "shirt xyz" 
}

_id = 2
{
   name: "blouse xyz" 
}

_id = 3
{
    name: "blouse wvc"
}

My query is

  {
	"query":
    {
    	"query_string":
        {
        	"fields":["name"], 
        	"query":"shirt", 
        	"analyzer":"synonym_brazilian_analyzer"
        }
    }
}

If I search for "shirt", applying the synonym analyzer just at query time, shouldn't the _id=1 document have higher score than the _id=2 one?

I'm asking that because, according to the explain clause, both have exactly the same score.
My point is: what exactly is the query time advantage for relevance considering simple expansion?
What about the PerFieldSimilarity calculation?
Shouldn't the "shirt xyz" text have more relevance than the "blouse xyz", at query time?

Thanks a lot,

Guilherme


(Adrien Grand) #2

I get the same score for all documents when I run the following:

DELETE index

PUT index 
{
  "settings": {
    "number_of_shards": 1, 
    "analysis": {
      "filter": {
        "my_syns": {
          "type": "synonym",
          "synonyms" : [
            "shirt,blouse"
          ]
        }
      },
      "analyzer": {
        "my_index_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace"
        },
        "my_search_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["my_syns"]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "field": {
          "type": "text",
          "analyzer": "my_index_analyzer",
          "search_analyzer": "my_search_analyzer"
        }
      }
    }
  }
}

PUT index/doc/1
{
  "field": "shirt xyz"
}

PUT index/doc/2
{
  "field": "blouse xyz"
}

GET index/_search
{
  "query": {
    "match": {
      "field": "shirt"
    }
  }
}


(Guilherme Maranhao) #3

But, is that the expected?


(Adrien Grand) #4

To me it is. We do this on purpose on the Lucene side by using a SynonymQuery which merges statistics in order to make sure which synonym is used does not matter.


(Guilherme Maranhao) #5

Ok, but what about the advantage for analyzing it at query time mentioned here https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html ?

The IDF for each synonym will be correct.

Thank you!


(Guilherme Maranhao) #6

Hi,

I got it! The IDF at query time considers the relevance of each of the synonyms in the whole index not only the specific term that is being searched.
So, the documents which contain the most relevant synonym word will have the highest scores.

Thank you


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.