Span_or not applying higher score when matching multiple unique terms

robmartin11 · March 23, 2024, 9:34am

Normally when applying multiple terms clauses, the relevancy scoring is applied to each term and then summed together, meaning matches on multiple unique terms get higher scores than multiple matches on the same term (because of the k1 value in BM25).
When using span_term within span_or it seems all terms are treated the same so that uniqueness of terms makes no difference.
For example

POST /people
{
    "mappings": {
        "properties": {
            "name": {
                "type": "text"
            }
        }
    }
}

POST /people/_bulk
{ "index": { "_id": "1" } }
{ "name": "Rob" }
{ "index": { "_id": "2" } }
{ "name": "Rob Rob" }
{ "index": { "_id": "3" } }
{ "name": "Martin Martin" }
{ "index": { "_id": "4" } }
{ "name": "Rob J Martin" }
{ "index": { "_id": "5" } }
{ "name": "Martin" }

GET people/_search
{
    "query": {
        "span_near": {
            "clauses": [
                {
                    "span_or": {
                        "clauses": [
                            {
                                "span_term": {
                                    "name": "rob"
                                }
                            },
                            {
                                "span_term": {
                                    "name": "martin"
                                }
                            }
                        ]
                    }
                }
            ],
            "slop": 3,
            "in_order": false
        }
    }
}

This results in

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": 1.437324,
        "hits": [
            {
                "_index": "people",
                "_id": "2",
                "_score": 1.437324,
                "_source": {
                    "name": "Rob Rob"
                }
            },
            {
                "_index": "people",
                "_id": "3",
                "_score": 1.437324,
                "_source": {
                    "name": "Martin Martin"
                }
            },
            {
                "_index": "people",
                "_id": "1",
                "_score": 1.3175471,
                "_source": {
                    "name": "Rob"
                }
            },
            {
                "_index": "people",
                "_id": "5",
                "_score": 1.3175471,
                "_source": {
                    "name": "Martin"
                }
            },
            {
                "_index": "people",
                "_id": "4",
                "_score": 1.2482024,
                "_source": {
                    "name": "Rob J Martin"
                }
            }
        ]
    }
}

Id normally expect Rob J Martin to get the highest score as it contains both terms but it actually appears last!

The explain query confirms that for that record both Rob and Martin are treated as the same term (phraseFreq=2.0)

{
    "_index": "people",
    "_id": "4",
    "matched": true,
    "explanation": {
        "value": 1.2482024,
        "description": "weight(spanOr([name:rob, name:martin]) in 3) [PerFieldSimilarity], result of:",
        "details": [
            {
                "value": 1.2482024,
                "description": "score(freq=2.0), computed as boost * idf * tf from:",
                "details": [
                    {
                        "value": 2.2,
                        "description": "boost",
                        "details": []
                    },
                    {
                        "value": 1.077993,
                        "description": "idf, sum of:",
                        "details": [
                            {
                                "value": 0.5389965,
                                "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                "details": [
                                    {
                                        "value": 3,
                                        "description": "n, number of documents containing term",
                                        "details": []
                                    },
                                    {
                                        "value": 5,
                                        "description": "N, total number of documents with field",
                                        "details": []
                                    }
                                ]
                            },
                            {
                                "value": 0.5389965,
                                "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                                "details": [
                                    {
                                        "value": 3,
                                        "description": "n, number of documents containing term",
                                        "details": []
                                    },
                                    {
                                        "value": 5,
                                        "description": "N, total number of documents with field",
                                        "details": []
                                    }
                                ]
                            }
                        ]
                    },
                    {
                        "value": 0.5263158,
                        "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                        "details": [
                            {
                                "value": 2.0,
                                "description": "phraseFreq=2.0",
                                "details": []
                            },
                            {
                                "value": 1.2,
                                "description": "k1, term saturation parameter",
                                "details": []
                            },
                            {
                                "value": 0.75,
                                "description": "b, length normalization parameter",
                                "details": []
                            },
                            {
                                "value": 3.0,
                                "description": "dl, length of field",
                                "details": []
                            },
                            {
                                "value": 1.8,
                                "description": "avgdl, average length of field",
                                "details": []
                            }
                        ]
                    }
                ]
            }
        ]
    }
}```

system · April 20, 2024, 9:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.