Failure on document_parsing_exception - dot_product similarity on dense_vector index field

ORipalta · November 8, 2023, 2:45pm

I'm trying to use dot-plot similarity on Elasticsearch. But after creating the index, the data/rows fail to load due to document_parsing_exception.

The error message that is being returned is failed to parse: The [dot_product] similarity can only be used with unit-length vectors..

Not sure why this is happening, I've double checked the dimensions and attempted to cast each vector point as float16 and float32 but still won't work. For reference this is how one of my vector points looks like: -0.29541016

The model I'm using to create the embeddings is msmarco-distilbert-base-dot-prod-v3 and I am running my code in Python.

This is the index I currently have.

index_mapping = {
    "properties": {
        "title": {
            "type": "text"
        },
        "abstract": {
            "type": "text"
        },
        "doc_len": {
            "type": "long"
        },
        "vector": {
            "type": "dense_vector",
            "dims": 768,
            "index": True,
            "similarity": "dot_product"
        }
    }
}

Any help will be greatly appreciated.

Carlos_D · November 8, 2023, 3:19pm

Hi @ORipalta !

For using dot_product similarity, all your vectors must be of unit length (see dense_vector parameters):

When element_type is float , all vectors must be unit length, including both document and query vectors.

You're probably missing normalization for your vectors.
In case you don't have normalized vectors, you can use cosine similarity, which will be less efficient.

ORipalta · November 8, 2023, 4:59pm

Thank you Carlos for your response!

I'm a bit new to embeddings so I just want to ask some followup questions if that's okay.

What does exactly mean "of unit length"?

I've also read the normalisation article which was very useful. The idea that I'm getting is that I have to normalise every single vector point inside the vector array using:

magnitude = magnitude(vector array)

for each single vector point in the vector array:
    single vector = single vector / magnitude

Carlos_D · November 8, 2023, 5:58pm

"of unit length" means the magnitude for each vector is 1. That's the end result of normalization - every vector will have magnitude 1.

The process you mention for normalization is the one you should take.

Keep in mind that you can use cosine similarity without the need for normalizing your vectors. That might be good for a first approach, and you'll skip the normalization process.

Thijsvdp · November 8, 2023, 11:10pm

It is important to note though that if you are able to do vector normalization during the preprocessing steps it is preferable. As per documentation cosine similarity is slower than dot-product, because cosine similarity normalizes the vectors on the fly.

ORipalta · November 13, 2023, 9:58pm

Does it matter what similarity method do you use? I've read that some models are specifically trained with cosine and others with dot-plot. Therefore, is it safe to assume that one should be using the respective similarity method for optimal results?

On top of this I've also read this in a blog:

Blockquote Also, models tuned for cosine-similarity will prefer the retrieval of short documents, while models tuned for dot-product will prefer the retrieval of longer documents.

Is this something that holds true across all or most models?

system · December 11, 2023, 9:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.