How vector based text similarity works under the hood?

This blog says that

we’ve provided a match_all query, which means the script will be run over all documents in the index. This is a current limitation of vector similarity in Elasticsearch — vectors can be used for scoring documents, but not in the initial retrieval step. Support for retrieval based on vector similarity is an important area of ongoing work.

Does this mean that all the documents are scanned in a linear fashion to calculate the cosine similarity or any other metric and then return the top k results ?.

If yes
How does the elastic search do it so quickly ? For 50,000 documents it takes about 50ms thats insanely fast.

If not. How does elastic search implements this ?.

For all documents that match a given query, a cosine similarity will be calculated in a linear fashion.
If you are using match_all query, that means for all documents in the index this calculation will be done. You can choose a more restrictive query.

If you have several shards, these calculations will be run in parallel for each shard. Other than that there is no any parallelism.

We try to do some optimizations but not much: e.g. vector length is calculated and stored during indexing, so during search it just retrieved; using NIO ByteBuffer to decode vectors etc.

2 Likes

HI Mayya,
I hope you and your family are safe and thank you so much for your quick reply.

I have a single shard and I have about 50k documents and searching takes less than 50ms so its a little hard to digest that cosine similarity will be calculated in a linear fashion for every query. I mean calculating cosine similarity between 50k docs and input query and returning top 10 in 50ms seems magical. How is this so fast ?

I also tried using Amazon's KNN implementation. but I did not see any significant improvements in the response time.

I am new to the ELK stack I read the docs and saw a bunch of videos but couldn't find the answer. Sorry to bother you.

Thank you for warm wishes.

The speed of the elasticsearch cosine similarity depends on 2 factors:

  • number of docs
  • number of dimensions in each doc

Since documents are scanned in a linear fashion, the time it takes for cosine similarity query will grow linearly with a number of documents. 50K is a pretty small collection. Here we have benchmarks on some bigger doc collections (under the section " Bruteforce benchmarks").

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.