Store dense vectors of size more than 2048

I am trying to build a semantic text search using ElasticSearch. For this I have decided to use the cosine similarity between the embeddings of the query paragraph and the embeddings of the search paragraphs.

The problem is that the dense vector datatype only has the max size of 2048 but the size of my embedding is 8000, how can I store a vector of size 8000 and calculate cosine on that using elasticsearch.

Any help or suggestion would be highly appriciated.

There is currently no way to store vectors of 8K dims.

We haven't yet encountered models that need that many dimensions.
Would you mind sharing that models produce that many dimensions?

1 Like

Hi @mayya,
Thank you for your response.
I am using TF-IDF to generate the embeddings vector. I experimented with a lot of models but tf-idf gave the best results in my case, maybe because the task is very domain specific.

Can we not save the embeddings into the array datatype and use script score (cosine) on that. Is it possible, if yes does it affect the response time compared to dense vectors?

I am using TF-IDF to generate the embeddings vector.

Is there a reason why you don't use elasticsearch for this, as it was specifically designed to generate term frequencies and inverse document frequencies and calculate scores based on them. If the default BM25 similarity doesn't suit you, you can design your own scripted similarity.

Can we not save the embeddings into the array datatype and use script score (cosine) on that.

This is not going to work, as arrays internally will store data in a sorted way. They don't keep the order of elements as dense vectors can. For example, an array [0, -10, 20], will be stored as [-10, 0, 20], and when you access doc['my_array'][0] you will get -10 instead desired 0.

3 Likes

Thank you for the clarification. The only reason why I wanted to store my custom TF-IDF was that it was giving slightly better results than the default tfidf based search in ElasticSearch, but I will look into modifying the scripted similarity that you suggested.
Thanks again!

1 Like