Hi all!
I just ran into an issue with the scoring of multiword synonyms.
When a synonym resolves to a multiword synonym, the underlying scoring model seems to change. Resulting in unfair scoring for documents that contain the multi-word synonym.
I created a small setup to reproduce my issue (bear with me, it's a lot of code):
DELETE test
PUT test
{
"settings": {
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
"filter": {
"synonyms": {
"type": "synonym_graph",
"synonyms": [
"usa, america",
"usa, united states"
]
}
},
"analyzer": {
"synonym_analyzer": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"synonyms"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"title": {
"type": "text",
"analyzer": "synonym_analyzer"
},
"description": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}
}
}
When analyzing two queries, the Lucene queries are as follows:
Result 1:
GET test/_validate/query?rewrite=true
{
"query": {
"multi_match": {
"query": "usa",
"type": "cross_fields",
"fields": [
"title",
"description"
]
}
}
}
Result 2:
GET test/_validate/query?rewrite=true
{
"query": {
"multi_match": {
"query": "america",
"type": "cross_fields",
"fields": [
"title",
"description"
]
}
}
}
Result 1:
(title:america | description:america) (title:"united states" | description:"united states") (title:usa | description:usa)
Result 2:
(title:usa | title:america | description:usa | description:america)
When executing the above mentioned ES queries and analyzing the result with explain
, we can see the following effect:
- In the scores of the first query, the individual query parts get SUM'ed.
- In the second query the MAX of the results is used for scoring.
This generates a totally different search result in a big data set.
Results showing this:
POST test/test/1
{
"title": "The USA is a very big country",
"description": "Book about the states"
}
POST test/test/2
{
"title": "The USA is a very big country",
"description": "Movie about the united states"
}
POST test/_refresh
Query:
GET test/_search
{
"explain": true,
"query": {
"multi_match": {
"query": "usa",
"type": "cross_fields",
"fields": [
"title",
"description"
]
}
}
}
Explain:
"_source": {
"title": "The USA is a very big country",
"description": "Movie about the united states"
},
"_explanation": {
"value": 1.272605,
"description": "sum of:",
"details": [
{
"value": 0.19856805,
"description": "max of:",
"details": [
{
"value": 0.19856805,
"description": "weight(title:america in 1) [PerFieldSimilarity], result of:",
"details": [
{
... rest removed
Query:
GET test/_search
{
"explain": true,
"query": {
"multi_match": {
"query": "america",
"type": "cross_fields",
"fields": [
"title",
"description"
]
}
}
}
Explain:
"_source": {
"title": "The USA is a very big country",
"description": "Book about the states"
},
"_explanation": {
"value": 0.19856805,
"description": "max of:",
"details": [
{
"value": 0.19856805,
"description": "weight(title:usa in 0) [PerFieldSimilarity], result of:",
"details": [
{
... rest removed
Is this expected behavior? Or is Lucene/Elasticsearch weird in handling multiword synonym expansions?
Running Elasticsearch 6.4.2