Custom function for Text Similarity Search

I'm using a dense_vector and cosineSimilarity to get documents similarity following the good tutorial here

script_query = {
        "script_score": {
            "query": {"match_all": {}},
            "script": {
                "source": "cosineSimilarity(params.query_vector, doc['text_vector']) + 1.0",
                "params": {"query_vector": query_vector}
            }
        }
    }

As for release 7.3 elasticsearch provide natively for the script the cosineSimilarity. When working in the STS (Sentence Textual Similarity), there are more choices for the metrics, among them:

  • Euclidean distance
  • Manhattan distance
  • Cosine distance (equivalente alla Euclidean distance dei vettori normalizzati)
  • Hamming distance
  • Dot (Inner) Product distance

So, how to implement a custom ElasticSearch similarity function for the search query script, let's say Euclidean or dot product?

Thank you.

NOTE

  • My reference project was BertSearch, were the textual embedding has been calculated with Google's BERT.
  • We should keep in mind that for a given vectorial distance to get the similarity a common transformation is:similarity = 1 / (1 + distance).
  • Regarding BERT, a good similarity scoring approach is described in bert_score

Hello!
From 7.3 we have the following vector functions available: cosineSimilarity and dotProduct.

From 7.4 two more functions added: l1norm (manhattan distance) and l2norm (euclidean distance).

We are still investigating the need for bit vectors and hamming distance.

how to implement a custom Elasticsearch similarity function for the search query script, let's say Euclidean or dot product?

They are already implemented from 7.3 (dotProduct) and 7.4 (euclidean distance).
There is no a straightforward approach to implement custom distance functions, as this would require the development of plugins. If you think some function is widely used and not implemented yet, please open an issue in the elasticsearch github and we will discuss it.

We should keep in mind that for a given vectorial distance to get the similarity a common transformation is: similarity = 1 / (1 + distance) .

In script_score query, you can do any transformation with the calculated distance including the one you needed.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.