#custom_similarity #_ignored #search_problem #bm25 #tfidf
Hi guys,
First doubt: while indexing a field with large text along with other lightweight fields, the field gets ignored. But searching works in that field. how?
example snippet:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 10000,
"relation": "gte"
},
"max_score": 1,
"hits": [
{
"_index": "tfdif",
"_id": "830",
"_score": 1,
"_ignored": [
"abstract.keyword"
],
"_source": {
"title": "XXXXXXXXXXX Sentiment Classification.",
"year": "2019",
"authors": [
" Singhal",
"Chakeyrabarti",
"Tanmmooy "
],
"url": "https://030-16148-4_15",
"abstract": "Now is the winter of our discontent
Made glorious summer by this sun of York;
And all the clouds that lour'd upon our house
In the deep bosom of the ocean buried.
Now are our brows bound with victorious wreaths;
Our bruised arms hung up for monuments;
Our stern alarums changed to merry meetings,
Our dreadful marches to delightful measures.
Grim-visaged war hath smooth'd his wrinkled front;
And now, instead of mounting barded steeds
To fright the souls of fearful adversaries,
He capers nimbly in a lady's chamber
To the lascivious pleasing of a lute.
But I, that am not shaped for sportive tricks,
Nor made to court an amorous looking-glass;
I, that am rudely stamp'd, and want love's majesty
To strut before a wanton ambling nymph;
I, that am curtail'd of this fair proportion,",
"prof-names": [
"Soumenly"
]
Second doubt: Using bm25 default functionality and custom tfidf similarity gives exactly identical results.
settings for bm25:
bm25_model = {
"settings": {
"analysis": {
"analyzer": {
"my_stop_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"english_stop",
"lowercase"
]
}
}
},
"index": {
"similarity": {
"bm25": {
"type": "BM25",
"b": 0.75,
"k1": 1.5
}
}
}
},
"mappings": {
"properties": {
"id": {"type": "integer"},
"prof-name": {"type": "text"},
"abstract": {
"type": "text",
"analyzer": "standard",
"search_analyzer": "my_stop_analyzer",
"similarity": "bm25"
}
}
}
settings for tfidf:
tfdif_model = {
"settings" : {
"analysis" : {
"analyzer" : {
"my_stop_analyzer":{
"type":"custom",
"tokenizer":"standard",
"filter":[
"english_stop",
"lowercase"
]
}
}
}
},
"index" : {
"similarity" : {
"scripted_tfidf": {
"type": "scripted",
"script": {
"source": "double tf = (doc.freq)/((doc.length)+1.0); double idf = Math.log((field.docCount + 1.0)/(term.docFreq+1.0)); return query.boost * tf * idf;"
}
}
}
},
"mappings" : {
"properties" : {
"id" : {"type" : "integer"},
"prof-name": {"type": "text"},
"abstract" : {
"type": "text",
"ignore_malformed": True,
"analyzer": "standard",
"search_analyzer": "my_stop_analyzer",
"similarity": "scripted_tfidf"
}
}
}
}
scores for searching on both indexes gives exactly same results:
[[ 4.412977],
[ 4.2717557],
[4.204721],
[ 4.1907907],
[4.177283],
[ 4.1456566],
[ 4.134912],
[ 3.7425122],
[ 3.7122254],
[3.6996274]]
Thanks, guys. Hope I get some insights into what needs to be done.
Edit: The bm25 settings are not related to the first doubt's example snippet. That thing is entirely different.