Tf-idf custom similarity and bm25 gives same scores and identical results along with a minor problem

dayzero · September 20, 2022, 8:47pm

#custom_similarity #_ignored #search_problem #bm25 #tfidf
Hi guys,
First doubt: while indexing a field with large text along with other lightweight fields, the field gets ignored. But searching works in that field. how?
example snippet:

{
 "took": 2,
 "timed_out": false,
 "_shards": {
   "total": 1,
   "successful": 1,
   "skipped": 0,
   "failed": 0
 },
 "hits": {
   "total": {
     "value": 10000,
     "relation": "gte"
   },
   "max_score": 1,
   "hits": [
     {
       "_index": "tfdif",
       "_id": "830",
       "_score": 1,
       "_ignored": [
         "abstract.keyword"
       ],
       "_source": {
         "title": "XXXXXXXXXXX Sentiment Classification.",
         "year": "2019",
         "authors": [
           " Singhal",
           "Chakeyrabarti",
           "Tanmmooy "
         ],
         "url": "https://030-16148-4_15",
         "abstract": "Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,",
         "prof-names": [
           "Soumenly"
         ]

Second doubt: Using bm25 default functionality and custom tfidf similarity gives exactly identical results.
settings for bm25:

bm25_model = {
    "settings": {
        "analysis": {
            "analyzer": {
                "my_stop_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "english_stop",
                        "lowercase"
                    ]
                }
            }
        },
        "index": {
            "similarity": {
                "bm25": {
                    "type": "BM25",
                    "b": 0.75,
                    "k1": 1.5
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "id": {"type": "integer"},
            "prof-name": {"type": "text"},
            "abstract": {
                "type": "text",
                "analyzer": "standard",
                "search_analyzer": "my_stop_analyzer",
                "similarity": "bm25"
            }
        }
    }

settings for tfidf:

tfdif_model = {
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "my_stop_analyzer":{ 
                   "type":"custom",
                   "tokenizer":"standard",
                   "filter":[
                       "english_stop",
                       "lowercase"
                   ]
                }
            }
        }
    },
    "index" : {
        "similarity" : {
            "scripted_tfidf": {
                "type": "scripted",
                "script": {
                  "source": "double tf = (doc.freq)/((doc.length)+1.0); double idf = Math.log((field.docCount + 1.0)/(term.docFreq+1.0));  return query.boost * tf * idf;"
                }
            }
        }
    },
    "mappings" : {
        "properties" : {
            "id" : {"type" : "integer"},
            "prof-name": {"type": "text"},
            "abstract" : {
                "type": "text",
                "ignore_malformed": True,
                "analyzer": "standard",
                "search_analyzer": "my_stop_analyzer",
                "similarity": "scripted_tfidf"
            }
        }
    }
}

scores for searching on both indexes gives exactly same results:

[[ 4.412977],
 [ 4.2717557],
 [4.204721],
 [ 4.1907907],
 [4.177283],
 [ 4.1456566],
 [ 4.134912],
 [ 3.7425122],
 [ 3.7122254],
 [3.6996274]]

Thanks, guys. Hope I get some insights into what needs to be done.
Edit: The bm25 settings are not related to the first doubt's example snippet. That thing is entirely different.

dayzero · September 21, 2022, 9:29am

I did set explain = True while searching and found that the scripted similarity of tfidf is not in use. Instead it is using bm25 by default. How can I possibly change the default similarity?

dayzero · September 25, 2022, 11:08am

Solution Update : The Elasticsearch is not using the source of the custom script when the index is created with elasticsearch python client. I tried indexing the same settings directly in kibana console, it worked as expected. And the _ignored field also got resolved. If someone can answer why it happened, it would be great to learn.

system · October 23, 2022, 11:09am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom similarity without TF/IDF scoring Elasticsearch	1	321	September 2, 2020
BM25 how to scoring ignore idf or set scope for total number of documents with search field Elasticsearch	1	620	January 26, 2022
Ignore TF/IDF in a complex query Elasticsearch	3	878	August 2, 2018
Elasticseach: Default Similairty Algorithm and BM25 giving same results Elasticsearch	12	2193	November 14, 2018
Search over most frequent matches / terms without TF or IDF adjustment Elasticsearch	1	553	July 5, 2017

Tf-idf custom similarity and bm25 gives same scores and identical results along with a minor problem

Related topics