Store dense vectors of size more than 2048

techytushar · July 1, 2020, 8:32am

I am trying to build a semantic text search using ElasticSearch. For this I have decided to use the cosine similarity between the embeddings of the query paragraph and the embeddings of the search paragraphs.

The problem is that the dense vector datatype only has the max size of 2048 but the size of my embedding is 8000, how can I store a vector of size 8000 and calculate cosine on that using elasticsearch.

Any help or suggestion would be highly appriciated.

mayya · July 2, 2020, 12:57pm

There is currently no way to store vectors of 8K dims.

We haven't yet encountered models that need that many dimensions.
Would you mind sharing that models produce that many dimensions?

techytushar · July 2, 2020, 7:25pm

Hi @mayya,
Thank you for your response.
I am using TF-IDF to generate the embeddings vector. I experimented with a lot of models but tf-idf gave the best results in my case, maybe because the task is very domain specific.

Can we not save the embeddings into the array datatype and use script score (cosine) on that. Is it possible, if yes does it affect the response time compared to dense vectors?

mayya · July 2, 2020, 8:54pm

I am using TF-IDF to generate the embeddings vector.

Is there a reason why you don't use elasticsearch for this, as it was specifically designed to generate term frequencies and inverse document frequencies and calculate scores based on them. If the default BM25 similarity doesn't suit you, you can design your own scripted similarity.

Can we not save the embeddings into the array datatype and use script score (cosine) on that.

This is not going to work, as arrays internally will store data in a sorted way. They don't keep the order of elements as dense vectors can. For example, an array [0, -10, 20], will be stored as [-10, 0, 20], and when you access doc['my_array'][0] you will get -10 instead desired 0.

techytushar · July 3, 2020, 9:26am

Thank you for the clarification. The only reason why I wanted to store my custom TF-IDF was that it was giving slightly better results than the default tfidf based search in ElasticSearch, but I will look into modifying the scripted similarity that you suggested.
Thanks again!

system · July 31, 2020, 9:26am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance and storage of the dense_vector type Elasticsearch	3	2537	April 22, 2021
Dense_vector size Elasticsearch	1	702	March 26, 2019
Elasticsearch dense_vector is taking up too much storage space！Help Elasticsearch vector-search	8	138	September 24, 2024
Is there any way we can use list of vectors to store in ElasticSearch and what are the corresponding changes required in ES query for calculating cosine similarity Elasticsearch	2	355	June 28, 2021
Storage issues in es vectorized retrieval Elasticsearch	4	33	October 8, 2024

Store dense vectors of size more than 2048

Related topics