Normally when applying multiple terms clauses, the relevancy scoring is applied to each term and then summed together, meaning matches on multiple unique terms get higher scores than multiple matches on the same term (because of the k1
value in BM25).
When using span_term within span_or it seems all terms are treated the same so that uniqueness of terms makes no difference.
For example
POST /people
{
"mappings": {
"properties": {
"name": {
"type": "text"
}
}
}
}
POST /people/_bulk
{ "index": { "_id": "1" } }
{ "name": "Rob" }
{ "index": { "_id": "2" } }
{ "name": "Rob Rob" }
{ "index": { "_id": "3" } }
{ "name": "Martin Martin" }
{ "index": { "_id": "4" } }
{ "name": "Rob J Martin" }
{ "index": { "_id": "5" } }
{ "name": "Martin" }
GET people/_search
{
"query": {
"span_near": {
"clauses": [
{
"span_or": {
"clauses": [
{
"span_term": {
"name": "rob"
}
},
{
"span_term": {
"name": "martin"
}
}
]
}
}
],
"slop": 3,
"in_order": false
}
}
}
This results in
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 5,
"relation": "eq"
},
"max_score": 1.437324,
"hits": [
{
"_index": "people",
"_id": "2",
"_score": 1.437324,
"_source": {
"name": "Rob Rob"
}
},
{
"_index": "people",
"_id": "3",
"_score": 1.437324,
"_source": {
"name": "Martin Martin"
}
},
{
"_index": "people",
"_id": "1",
"_score": 1.3175471,
"_source": {
"name": "Rob"
}
},
{
"_index": "people",
"_id": "5",
"_score": 1.3175471,
"_source": {
"name": "Martin"
}
},
{
"_index": "people",
"_id": "4",
"_score": 1.2482024,
"_source": {
"name": "Rob J Martin"
}
}
]
}
}
Id normally expect Rob J Martin
to get the highest score as it contains both terms but it actually appears last!
The explain query confirms that for that record both Rob
and Martin
are treated as the same term (phraseFreq=2.0
)
{
"_index": "people",
"_id": "4",
"matched": true,
"explanation": {
"value": 1.2482024,
"description": "weight(spanOr([name:rob, name:martin]) in 3) [PerFieldSimilarity], result of:",
"details": [
{
"value": 1.2482024,
"description": "score(freq=2.0), computed as boost * idf * tf from:",
"details": [
{
"value": 2.2,
"description": "boost",
"details": []
},
{
"value": 1.077993,
"description": "idf, sum of:",
"details": [
{
"value": 0.5389965,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 3,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 5,
"description": "N, total number of documents with field",
"details": []
}
]
},
{
"value": 0.5389965,
"description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details": [
{
"value": 3,
"description": "n, number of documents containing term",
"details": []
},
{
"value": 5,
"description": "N, total number of documents with field",
"details": []
}
]
}
]
},
{
"value": 0.5263158,
"description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details": [
{
"value": 2.0,
"description": "phraseFreq=2.0",
"details": []
},
{
"value": 1.2,
"description": "k1, term saturation parameter",
"details": []
},
{
"value": 0.75,
"description": "b, length normalization parameter",
"details": []
},
{
"value": 3.0,
"description": "dl, length of field",
"details": []
},
{
"value": 1.8,
"description": "avgdl, average length of field",
"details": []
}
]
}
]
}
]
}
}```