How vector based text similarity works under the hood?

yash_budukh · June 17, 2020, 3:17pm

we’ve provided a match_all query, which means the script will be run over all documents in the index. This is a current limitation of vector similarity in Elasticsearch — vectors can be used for scoring documents, but not in the initial retrieval step. Support for retrieval based on vector similarity is an important area of ongoing work.

Does this mean that all the documents are scanned in a linear fashion to calculate the cosine similarity or any other metric and then return the top k results ?.

If yes
How does the elastic search do it so quickly ? For 50,000 documents it takes about 50ms thats insanely fast.

If not. How does elastic search implements this ?.

mayya · June 17, 2020, 6:23pm

For all documents that match a given query, a cosine similarity will be calculated in a linear fashion.
If you are using match_all query, that means for all documents in the index this calculation will be done. You can choose a more restrictive query.

If you have several shards, these calculations will be run in parallel for each shard. Other than that there is no any parallelism.

We try to do some optimizations but not much: e.g. vector length is calculated and stored during indexing, so during search it just retrieved; using NIO ByteBuffer to decode vectors etc.

yash_budukh · June 17, 2020, 6:38pm

HI Mayya,
I hope you and your family are safe and thank you so much for your quick reply.

I have a single shard and I have about 50k documents and searching takes less than 50ms so its a little hard to digest that cosine similarity will be calculated in a linear fashion for every query. I mean calculating cosine similarity between 50k docs and input query and returning top 10 in 50ms seems magical. How is this so fast ?

I also tried using Amazon's KNN implementation. but I did not see any significant improvements in the response time.

I am new to the ELK stack I read the docs and saw a bunch of videos but couldn't find the answer. Sorry to bother you.

mayya · June 17, 2020, 7:35pm

Thank you for warm wishes.

The speed of the elasticsearch cosine similarity depends on 2 factors:

number of docs
number of dimensions in each doc

Since documents are scanned in a linear fashion, the time it takes for cosine similarity query will grow linearly with a number of documents. 50K is a pretty small collection. Here we have benchmarks on some bigger doc collections (under the section " Bruteforce benchmarks").

system · July 15, 2020, 7:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Exactly which documents are used for vector calculation Elasticsearch	3	584	November 12, 2019
Vector-Based search using cosineSimilarity Elasticsearch	4	323	August 11, 2022
More search time Elasticsearch	10	553	June 18, 2020
Script score vector search performance Elasticsearch	3	579	September 22, 2022
Is there any way we can use list of vectors to store in ElasticSearch and what are the corresponding changes required in ES query for calculating cosine similarity Elasticsearch	2	355	June 28, 2021

How vector based text similarity works under the hood?

Related topics