I am working on an use case which has documents in following format:-
[{'Fruits': ['Mango', 'Apple']},
{'Fruits': ['Banana', 'Guava', 'Mango']},
{'Fruits': ['Grapes', 'Apple']}]
I am storing the data into Elasticsearch cluster in below format:-
["Fruits":[{"value": dense_vector('Mango'),"text": "Mango"},{"value": dense_vector('Apple'),"text": "Apple"}],
"Fruits":[{"value": dense_vector('Banana'),"text": "Banana"},{"value": dense_vector('Guava'),"text": "Guava"},
{"value": dense_vector('Mango'),"text": "Mango"}],...]
Am storing dense vector(Bert embeddings) so as to handle synonyms by calculating cosine similarity. My Elasticsearch query looks like this"-
{
"nested": {
"path": "Fruits",
"score_mode": "max",
"query": {
"function_score": {
"script_score": {
"script": {
"source": "1+cosineSimilarity(params.query_vector0,'Fruits.value')+cosineSimilarity(params.query_vector1,'Fruits.value')",
"params": {"query_vector0":query_vector,"query_vector1":query_vector1}
}
}
}
}
}
}
,wherein query_vector and query_vector1 are the below values"-
query_vector=bert_embedding.encode([str(Grape)])[0]
query_vector1=bert_embedding.encode([str(Apples)])[0]
When I give user queries as "Grape" and "Apples", I am getting similar score for these 2 documents:-
[{'Fruits': ['Mango', 'Apple']},
{'Fruits': ['Grapes', 'Apple']}]
Whereas, as expected, I should get a higher score for {'Fruits': ['Grapes', 'Apple']} when compared to the 1st document. Also , upon changing the value for "score_mode" to avg/sum/max, the results are changing. Which score_mode should I finalize on based on my document format?
Am not able to figure out how cosine score is similar for both the documents. Can someone please help on that?Preformatted text