I need to end up with a score where doc 1 and doc 2 both score 100% while document 3 scores 50%.
I'd like to score based on "Matching keywords" divided by length of keywords array.
BM25 with a k1 and b of 0 comes close, but it values longer arrays more than shorter arrays (In this case, document 2 scores higher than document 1).
Am I making things to difficult here?
Test index definition, relatively simpel for now.
PUT keyword_test
{
"mappings": {
"properties": {
"keyword": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "my_search_analyzer",
"norms":false,
"similarity": "my_bm25_similarity",
"fields":{
"keyword":{
"type":"keyword"
}
}
}
}},
"settings": {
"index": {
"number_of_shards": "1",
"analysis": {
"filter": {
"dutch_stemmer": {
"name": "dutch_kp",
"type": "stemmer"
},
"my_synonyms": {
"type": "synonym",
"synonyms": [
"synonym1, synonym2"
]
},
"dutch_stopwords": {
"type": "stop",
"stopwords": "dutch"
}
},
"analyzer": {
"my_search_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"dutch_stemmer"
],
"type": "custom",
"tokenizer": "standard"
},
"my_analyzer": {
"filter": [
"lowercase",
"asciifolding",
"my_synonyms",
"dutch_stemmer",
"dutch_stopwords"
],
"type": "custom",
"tokenizer": "standard"
}
}
},
"number_of_replicas": "0",
"similarity": {
"my_bm25_similarity": {
"type": "BM25",
"b": "1",
"k1": 1,
"discount_overlaps": "true"
},
"my_sim1": {
"type": "DFR",
"basic_model": "in",
"after_effect": "l",
"normalization": "h1"
},
"my_sim2": {
"type": "scripted",
"script": {
"source": "return doc.freq/doc.length;"
}
}
}
}
}
}
I have a list of documents with keywords in an array:
POST _bulk
{ "index" : { "_index" : "keyword_test", "_id" : "1" } }
{ "keyword": "TheKeywordImLookingFor" }
{ "index" : { "_index" : "keyword_test", "_id" : "2" } }
{ "keyword": "TheKeywordImLookingFor; Someadjective TheKeywordImLookingFor"}
{ "index" : { "_index" : "keyword_test", "_id" : "3" } }
{ "keyword": "TheKeywordImLookingFor; NotTheKeywordImLookingFor"}
Ideally; doc 1 and 2 receive the same score.
Sincethey both have 100% matching keywords (1 of 1 and 2 of 2).
However due to the way the terms are stored, this actuallybecomes 1 of 1 and 2 of 3 (Someadjective gets split into seperate term for doc 2).
Since the similarity module does not have access to the source doc, how can I work around this?
I basically want to calculate the following:
Amount of matching terms, divided by amount of elements in source array.