My question is somewhat similar to an old feature request Add match count scoring option · Issue #13806 · elastic/elasticsearch · GitHub but some options suggested there are already deprecated.
I'm trying to create a custom scoring that would return total number of matches in a document without normalizations or taking term stats in a shard into consideration.
Basically for document like
"hello hi hello", search "hello" should return 2 etc.
My best attempt so far was using custom scripted similarity:
PUT index
{
"settings": {
"number_of_shards": 1,
"similarity": {
"scripted_hits": {
"type": "scripted",
"script": {
"source": "return query.boost * doc.freq;"
}
}
}
},
"mappings": {
"properties": {
"id": {
"type": "integer"
},
"field": {
"type": "text",
"similarity": "scripted_hits"
}
}
}
}
POST index
{"index": { "_id": "1" }}
{ "field": "foo bar foo" }
{"index": { "_id": "2" }}
{ "id":2, "field": "bar baz" }
{"index": { "_id": "3" }}
{ "id":3, "field": "foo bar foo bar foo bar" }
{"index": { "_id": "4" }}
{ "id":4, "field": "foo bar foo bar foo bar test test2 test 3 foo test1 test2 test3" }
{"index": { "_id": "5" }}
{ "id":5, "field": "foo2" }
It seems to work well with term searches; span_near doubles the scores, which I can understand (both terms are matching after all, but I can fix it by providing boost: 0.5).
There is a big problem with wildcards though. So the case below returns max score of 4 (doc id 4, correct)
GET index/_search?explain=true
{
"query": {
"bool": {
"must": [
{
"term": {
"field": {
"value": "foo"
}
}
}
]
}
}
}
Wildcard always returns 1 regardless of number of matches (same doc id=4 returns a score of 4 now), it seems as either a bug not implemented feature:
GET index/_search?explain=true
{
"query": {
"bool": {
"must": [
{
"wildcard": {
"field": "foo"
}
}
]
}
}
}
Using of span_multi at least enables scoring based on number of hits but it doubles a score for "foo*" even though there are not hits for this one in a record:
GET index/_search?explain=true
{
"query": {
"bool": {
"must": [
{
"span_multi": {
"match": {
"wildcard": {
"field": "foo*"
}
}
}
}
]
}
}
}
Explanation has doc.freq=4 for both foo and foo2 even though foo2 is not a part of the document.
Same with Span_Or (which span_multi converts wildcard anyway)
GET index/_search?explain=true
{
"query": {
"bool": {
"must": [
{
"span_or": {
"clauses": [
{
"span_term": {
"field": "foo"
}
},
{
"span_term": {
"field": "foo2"
}
}
]
}
}
]
}
}
}
Is it a bug? Is there any workaround or a way to remove duplicates for span_or or wildcards?