Storage issues in es vectorized retrieval

When I searched for similarity again, I found the following problem. I matched the content with the highest similarity displayed by es's vector cosine similarity. I manually compared the question vector with the vector of the content with the highest similarity retrieved by es, and the question vector with the vector of the content that I thought should have the highest similarity. I found that the vector I thought was higher was indeed more similar.

dense_vector_query = {
            "size": top_k,
            "query": {
                "script_score": {
                    "query": {
                        "match_all": {} 
                    },
                    "script": {
                        "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
                        "params": {
                            "query_vector": embeddings  
                        }
                    }
                }
            }
        }
1 Like

Hi @chao_xi , welcome to our community.

How many dimensions are you defining? It's not mandatory but, you could ensure your vector embeddings are normalized if they are not. You can normalize them either before indexing them.

Example:

def normalize(vector):
    norm = np.linalg.norm(vector)
    return vector / norm if norm > 0 else vector

Thank you for your reply! As for the vector, I used the embeddings model of BGE, which converts the text into a 1024-dimensional vector. I directly used this vector for retrieval. I don't quite understand what you mean by normalization.

I compare the code for similarity

def calculate_similarity(vector1, vector2):
        dot_product = np.dot(vector1, vector2)
        magnitude1 = np.linalg.norm(vector1)
        magnitude2 = np.linalg.norm(vector2)
        similarity = dot_product / (magnitude1 * magnitude2)
        return similarity

Hi folks, I'm also encountering this issue, looking forward to a solution.