Custom scoring based on number of matches

My question is somewhat similar to an old feature request Add match count scoring option · Issue #13806 · elastic/elasticsearch · GitHub but some options suggested there are already deprecated.
I'm trying to create a custom scoring that would return total number of matches in a document without normalizations or taking term stats in a shard into consideration.
Basically for document like
"hello hi hello", search "hello" should return 2 etc.

My best attempt so far was using custom scripted similarity:

PUT index
{
    "settings": {
        "number_of_shards": 1,
        "similarity": {
            "scripted_hits": {
                "type": "scripted",
                "script": {
                    "source": "return query.boost * doc.freq;"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "id": {
                "type": "integer"
            },
            "field": {
                "type": "text",
                "similarity": "scripted_hits"
            }
        }
    }
}

POST index
{"index": { "_id": "1" }}
{ "field": "foo bar foo" }
{"index": { "_id": "2" }}
{ "id":2, "field": "bar baz" }
{"index": { "_id": "3" }}
{ "id":3, "field": "foo bar foo bar foo bar" }
{"index": { "_id": "4" }}
{ "id":4, "field": "foo bar foo bar foo bar test test2 test 3 foo test1 test2 test3" }
{"index": { "_id": "5" }}
{ "id":5, "field": "foo2" }

It seems to work well with term searches; span_near doubles the scores, which I can understand (both terms are matching after all, but I can fix it by providing boost: 0.5).
There is a big problem with wildcards though. So the case below returns max score of 4 (doc id 4, correct)

GET index/_search?explain=true
{
    "query": {
        "bool": {
            "must": [
                {
                    "term": {
                        "field": {
                            "value": "foo"
                        }
                    }
                }
            ]
        }
    }
}

Wildcard always returns 1 regardless of number of matches (same doc id=4 returns a score of 4 now), it seems as either a bug not implemented feature:

GET index/_search?explain=true
{
    "query": {
        "bool": {
            "must": [
                {
                    "wildcard": {
                        "field": "foo"
                    }
                }
            ]
        }
    }
}

Using of span_multi at least enables scoring based on number of hits but it doubles a score for "foo*" even though there are not hits for this one in a record:

GET index/_search?explain=true
{
    "query": {
        "bool": {
            "must": [
                {
                    "span_multi": {
                        "match": {
                            "wildcard": {
                                "field": "foo*"
                            }
                        }
                    }
                }
            ]
        }
    }
}

Explanation has doc.freq=4 for both foo and foo2 even though foo2 is not a part of the document.
Same with Span_Or (which span_multi converts wildcard anyway)

GET index/_search?explain=true
{
    "query": {
        "bool": {
            "must": [
                {
                    "span_or": {
                        "clauses": [
                            {
                                "span_term": {
                                    "field": "foo"
                                }
                            },
                            {
                                "span_term": {
                                    "field": "foo2"
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

Is it a bug? Is there any workaround or a way to remove duplicates for span_or or wildcards?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.