I am trying to build a semantic text search using ElasticSearch. For this I have decided to use the cosine similarity between the embeddings of the query paragraph and the embeddings of the search paragraphs.
The problem is that the dense vector datatype only has the max size of 2048 but the size of my embedding is 8000, how can I store a vector of size 8000 and calculate cosine on that using elasticsearch.
Any help or suggestion would be highly appriciated.
Hi @mayya,
Thank you for your response.
I am using TF-IDF to generate the embeddings vector. I experimented with a lot of models but tf-idf gave the best results in my case, maybe because the task is very domain specific.
Can we not save the embeddings into the array datatype and use script score (cosine) on that. Is it possible, if yes does it affect the response time compared to dense vectors?
I am using TF-IDF to generate the embeddings vector.
Is there a reason why you don't use elasticsearch for this, as it was specifically designed to generate term frequencies and inverse document frequencies and calculate scores based on them. If the default BM25 similarity doesn't suit you, you can design your own scripted similarity.
Can we not save the embeddings into the array datatype and use script score (cosine) on that.
This is not going to work, as arrays internally will store data in a sorted way. They don't keep the order of elements as dense vectors can. For example, an array [0, -10, 20], will be stored as [-10, 0, 20], and when you access doc['my_array'][0] you will get -10 instead desired 0.
Thank you for the clarification. The only reason why I wanted to store my custom TF-IDF was that it was giving slightly better results than the default tfidf based search in ElasticSearch, but I will look into modifying the scripted similarity that you suggested.
Thanks again!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.