I'm trying to use dot-plot similarity on Elasticsearch. But after creating the index, the data/rows fail to load due to document_parsing_exception.
The error message that is being returned is failed to parse: The [dot_product] similarity can only be used with unit-length vectors..
Not sure why this is happening, I've double checked the dimensions and attempted to cast each vector point as float16 and float32 but still won't work. For reference this is how one of my vector points looks like: -0.29541016
The model I'm using to create the embeddings is msmarco-distilbert-base-dot-prod-v3 and I am running my code in Python.
For using dot_product similarity, all your vectors must be of unit length (see dense_vector parameters):
When element_type is float , all vectors must be unit length, including both document and query vectors.
You're probably missing normalization for your vectors.
In case you don't have normalized vectors, you can use cosine similarity, which will be less efficient.
I'm a bit new to embeddings so I just want to ask some followup questions if that's okay.
What does exactly mean "of unit length"?
I've also read the normalisation article which was very useful. The idea that I'm getting is that I have to normalise every single vector point inside the vector array using:
magnitude = magnitude(vector array)
for each single vector point in the vector array:
single vector = single vector / magnitude
"of unit length" means the magnitude for each vector is 1. That's the end result of normalization - every vector will have magnitude 1.
The process you mention for normalization is the one you should take.
Keep in mind that you can use cosine similarity without the need for normalizing your vectors. That might be good for a first approach, and you'll skip the normalization process.
It is important to note though that if you are able to do vector normalization during the preprocessing steps it is preferable. As per documentation cosine similarity is slower than dot-product, because cosine similarity normalizes the vectors on the fly.
Does it matter what similarity method do you use? I've read that some models are specifically trained with cosine and others with dot-plot. Therefore, is it safe to assume that one should be using the respective similarity method for optimal results?
On top of this I've also read this in a blog:
Blockquote Also,models tuned for cosine-similarity will prefer the retrieval of short documents, while models tuned for dot-product will prefer the retrieval of longer documents.
Is this something that holds true across all or most models?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.