Tf-idf custom similarity and bm25 gives same scores and identical results along with a minor problem

#custom_similarity #_ignored #search_problem #bm25 #tfidf
Hi guys,
First doubt: while indexing a field with large text along with other lightweight fields, the field gets ignored. But searching works in that field. how?
example snippet:

{
 "took": 2,
 "timed_out": false,
 "_shards": {
   "total": 1,
   "successful": 1,
   "skipped": 0,
   "failed": 0
 },
 "hits": {
   "total": {
     "value": 10000,
     "relation": "gte"
   },
   "max_score": 1,
   "hits": [
     {
       "_index": "tfdif",
       "_id": "830",
       "_score": 1,
       "_ignored": [
         "abstract.keyword"
       ],
       "_source": {
         "title": "XXXXXXXXXXX Sentiment Classification.",
         "year": "2019",
         "authors": [
           " Singhal",
           "Chakeyrabarti",
           "Tanmmooy "
         ],
         "url": "https://030-16148-4_15",
         "abstract": "Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,",
         "prof-names": [
           "Soumenly"
         ]

Second doubt: Using bm25 default functionality and custom tfidf similarity gives exactly identical results.
settings for bm25:

bm25_model = {
    "settings": {
        "analysis": {
            "analyzer": {
                "my_stop_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "english_stop",
                        "lowercase"
                    ]
                }
            }
        },
        "index": {
            "similarity": {
                "bm25": {
                    "type": "BM25",
                    "b": 0.75,
                    "k1": 1.5
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "id": {"type": "integer"},
            "prof-name": {"type": "text"},
            "abstract": {
                "type": "text",
                "analyzer": "standard",
                "search_analyzer": "my_stop_analyzer",
                "similarity": "bm25"
            }
        }
    }

settings for tfidf:

tfdif_model = {
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "my_stop_analyzer":{ 
                   "type":"custom",
                   "tokenizer":"standard",
                   "filter":[
                       "english_stop",
                       "lowercase"
                   ]
                }
            }
        }
    },
    "index" : {
        "similarity" : {
            "scripted_tfidf": {
                "type": "scripted",
                "script": {
                  "source": "double tf = (doc.freq)/((doc.length)+1.0); double idf = Math.log((field.docCount + 1.0)/(term.docFreq+1.0));  return query.boost * tf * idf;"
                }
            }
        }
    },
    "mappings" : {
        "properties" : {
            "id" : {"type" : "integer"},
            "prof-name": {"type": "text"},
            "abstract" : {
                "type": "text",
                "ignore_malformed": True,
                "analyzer": "standard",
                "search_analyzer": "my_stop_analyzer",
                "similarity": "scripted_tfidf"
            }
        }
    }
}

scores for searching on both indexes gives exactly same results:

[[ 4.412977],
 [ 4.2717557],
 [4.204721],
 [ 4.1907907],
 [4.177283],
 [ 4.1456566],
 [ 4.134912],
 [ 3.7425122],
 [ 3.7122254],
 [3.6996274]]

Thanks, guys. Hope I get some insights into what needs to be done.
Edit: The bm25 settings are not related to the first doubt's example snippet. That thing is entirely different.

I did set explain = True while searching and found that the scripted similarity of tfidf is not in use. Instead it is using bm25 by default. How can I possibly change the default similarity?

Solution Update : The Elasticsearch is not using the source of the custom script when the index is created with elasticsearch python client. I tried indexing the same settings directly in kibana console, it worked as expected. And the _ignored field also got resolved. If someone can answer why it happened, it would be great to learn.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.