Similarity scoring based on array length and matching terms

KoPee · October 23, 2020, 10:29am

I need to end up with a score where doc 1 and doc 2 both score 100% while document 3 scores 50%.
I'd like to score based on "Matching keywords" divided by length of keywords array.

BM25 with a k1 and b of 0 comes close, but it values longer arrays more than shorter arrays (In this case, document 2 scores higher than document 1).

Am I making things to difficult here?

Test index definition, relatively simpel for now.

PUT keyword_test
{
  "mappings": {
    "properties": {
      "keyword": {
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "my_search_analyzer",
        "norms":false,
        "similarity": "my_bm25_similarity",
        "fields":{
          "keyword":{
            "type":"keyword"
          }
      }
    }
  }},
  "settings": {
    "index": {
      "number_of_shards": "1",
      "analysis": {
        "filter": {
          "dutch_stemmer": {
            "name": "dutch_kp",
            "type": "stemmer"
          },
          "my_synonyms": {
            "type": "synonym",
            "synonyms": [
              "synonym1, synonym2"
            ]
          },
          "dutch_stopwords": {
            "type": "stop",
            "stopwords": "dutch"
          }
        },
        "analyzer": {
          "my_search_analyzer": {
            "filter": [
              "lowercase",
              "asciifolding",
              "dutch_stemmer"
            ],
            "type": "custom",
            "tokenizer": "standard"
          },
          "my_analyzer": {
            "filter": [
              "lowercase",
              "asciifolding",
              "my_synonyms",
              "dutch_stemmer",
              "dutch_stopwords"
            ],
            "type": "custom",
            "tokenizer": "standard"
          }
        }
      },
      "number_of_replicas": "0",
      "similarity": {
        "my_bm25_similarity": {
          "type": "BM25",
          "b": "1",
          "k1": 1,
          "discount_overlaps": "true"
        },
        "my_sim1": {
          "type": "DFR",
          "basic_model": "in",
          "after_effect": "l",
          "normalization": "h1"
        },
        "my_sim2": {
          "type": "scripted",
          "script":  {
          "source": "return doc.freq/doc.length;"

        }
        }
      }
    }
  }
}

I have a list of documents with keywords in an array:

POST _bulk
{ "index" : { "_index" : "keyword_test", "_id" : "1" } }
{ "keyword": "TheKeywordImLookingFor" }
{ "index" : { "_index" : "keyword_test", "_id" : "2" } }
{ "keyword": "TheKeywordImLookingFor; Someadjective TheKeywordImLookingFor"}
{ "index" : { "_index" : "keyword_test", "_id" : "3" } }
{ "keyword": "TheKeywordImLookingFor; NotTheKeywordImLookingFor"}

Ideally; doc 1 and 2 receive the same score.
Sincethey both have 100% matching keywords (1 of 1 and 2 of 2).

However due to the way the terms are stored, this actuallybecomes 1 of 1 and 2 of 3 (Someadjective gets split into seperate term for doc 2).

Since the similarity module does not have access to the source doc, how can I work around this?
I basically want to calculate the following:

Amount of matching terms, divided by amount of elements in source array.

system · November 20, 2020, 10:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Score is lower if text is longer Elasticsearch	9	3834	July 6, 2017
BM25 how to scoring ignore idf or set scope for total number of documents with search field Elasticsearch	1	618	January 26, 2022
Elasticsearch - how to make shorter phrase more relevant in result Elasticsearch	2	624	September 13, 2019
Query regarding scoring of ES8. 2 Elasticsearch	4	241	October 5, 2022
Ngram score and length Elasticsearch	2	491	February 11, 2019

Similarity scoring based on array length and matching terms

Related topics