Storage issues in es vectorized retrieval

chao_xi · October 7, 2024, 11:27am

When I searched for similarity again, I found the following problem. I matched the content with the highest similarity displayed by es's vector cosine similarity. I manually compared the question vector with the vector of the content with the highest similarity retrieved by es, and the question vector with the vector of the content that I thought should have the highest similarity. I found that the vector I thought was higher was indeed more similar.

dense_vector_query = {
            "size": top_k,
            "query": {
                "script_score": {
                    "query": {
                        "match_all": {} 
                    },
                    "script": {
                        "source": "cosineSimilarity(params.query_vector, 'embedding') + 1.0",
                        "params": {
                            "query_vector": embeddings  
                        }
                    }
                }
            }
        }

Alex_Salgado-Elastic · October 8, 2024, 12:00am

Hi @chao_xi , welcome to our community.

How many dimensions are you defining? It's not mandatory but, you could ensure your vector embeddings are normalized if they are not. You can normalize them either before indexing them.

Example:

def normalize(vector):
    norm = np.linalg.norm(vector)
    return vector / norm if norm > 0 else vector

chao_xi · October 8, 2024, 12:19am

Thank you for your reply! As for the vector, I used the embeddings model of BGE, which converts the text into a 1024-dimensional vector. I directly used this vector for retrieval. I don't quite understand what you mean by normalization.

chao_xi · October 8, 2024, 12:26am

I compare the code for similarity

def calculate_similarity(vector1, vector2):
        dot_product = np.dot(vector1, vector2)
        magnitude1 = np.linalg.norm(vector1)
        magnitude2 = np.linalg.norm(vector2)
        similarity = dot_product / (magnitude1 * magnitude2)
        return similarity

sunyicode0012 · October 8, 2024, 8:19am

Hi folks, I'm also encountering this issue, looking forward to a solution.

Topic		Replies	Views
Store dense vectors of size more than 2048 Elasticsearch	5	2554	July 31, 2020
Text Embedding in Elastic Search Elasticsearch	2	426	June 3, 2021
Is there any way we can use list of vectors to store in ElasticSearch and what are the corresponding changes required in ES query for calculating cosine similarity Elasticsearch	2	355	June 28, 2021
ALgorithm in ElasticSearch for similarity distances between 2 floating vectors Elasticsearch	2	582	February 12, 2021
How vector based text similarity works under the hood? Elasticsearch	4	775	July 15, 2020

Storage issues in es vectorized retrieval

Related topics