Calculate cosine similarity for nested object in elasticsearch using Python

I am working on an use case which has documents in following format:-

[{'Fruits': ['Mango', 'Apple']},
 {'Fruits': ['Banana', 'Guava', 'Mango']},
 {'Fruits': ['Grapes', 'Apple']}]

I am storing the data into Elasticsearch cluster in below format:-

["Fruits":[{"value": dense_vector('Mango'),"text": "Mango"},{"value": dense_vector('Apple'),"text": "Apple"}],
"Fruits":[{"value": dense_vector('Banana'),"text": "Banana"},{"value": dense_vector('Guava'),"text": "Guava"},
{"value": dense_vector('Mango'),"text": "Mango"}],...]

Am storing dense vector(Bert embeddings) so as to handle synonyms by calculating cosine similarity. My Elasticsearch query looks like this"-

{
    "nested": {
      "path": "Fruits",
        "score_mode": "max", 
      "query": {
        "function_score": {
          "script_score": {
            "script": {
              "source": "1+cosineSimilarity(params.query_vector0,'Fruits.value')+cosineSimilarity(params.query_vector1,'Fruits.value')",
              "params": {"query_vector0":query_vector,"query_vector1":query_vector1}
            }
          }
        }
      }
    }
}

,wherein query_vector and query_vector1 are the below values"-

query_vector=bert_embedding.encode([str(Grape)])[0]
query_vector1=bert_embedding.encode([str(Apples)])[0]

When I give user queries as "Grape" and "Apples", I am getting similar score for these 2 documents:-

[{'Fruits': ['Mango', 'Apple']},
 {'Fruits': ['Grapes', 'Apple']}]

Whereas, as expected, I should get a higher score for {'Fruits': ['Grapes', 'Apple']} when compared to the 1st document. Also , upon changing the value for "score_mode" to avg/sum/max, the results are changing. Which score_mode should I finalize on based on my document format?

Am not able to figure out how cosine score is similar for both the documents. Can someone please help on that?Preformatted text

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.