I'm using a dense_vector
and cosineSimilarity
to get documents similarity following the good tutorial here
script_query = {
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilarity(params.query_vector, doc['text_vector']) + 1.0",
"params": {"query_vector": query_vector}
}
}
}
As for release 7.3 elasticsearch provide natively for the script the cosineSimilarity
. When working in the STS (Sentence Textual Similarity), there are more choices for the metrics, among them:
- Euclidean distance
- Manhattan distance
- Cosine distance (equivalente alla Euclidean distance dei vettori normalizzati)
- Hamming distance
- Dot (Inner) Product distance
So, how to implement a custom ElasticSearch similarity function for the search query script, let's say Euclidean or dot product?
Thank you.
NOTE
- My reference project was BertSearch, were the textual embedding has been calculated with Google's BERT.
- We should keep in mind that for a given vectorial distance to get the similarity a common transformation is:
similarity = 1 / (1 + distance)
. - Regarding BERT, a good similarity scoring approach is described in bert_score