Semantic search on more than 10k documents

Denis_Stefan · September 19, 2023, 12:25pm

Hello.

I am currently developing a semantic search solution and I have to work with more than 10k documents (more than the maximum number of candidates which is 10k for the kNN algorithm). I am trying to find a solution to search in whole index, not just 10k documents. Also I am interested in the way in which the 10k documents are selected for kNN.

So basically the question is how can I execute a semantic search on an index which has more than 10k documents and how the 10k candidates are selected?

Thank you.

Carlos_D · September 19, 2023, 1:59pm

Hi @Denis_Stefan !

When performing kNN search, each index shard will perform an approximate neighbours search. Each of these searches will at most consider num_candidates as the top results. Thus, num_candidates does limit the total number of documents searched on each shard.

knn query uses a data structure (HSNW) for making it easy to find the nearest neighbours to the vector that represents the search query. This is built to avoid having to compute the similarity to all vectors and retrieve the best possible candidates.

However, this is approximate. It offers a tradeoff between speed and precision, as we depend on the num_candidates searched. Increasing the num_candidates will increase precision (as more candidates will be considered) at the cost of search speed.

Once num_candidates have been calculated on each shard, the top k will be selected on each shard. After that, the query will return the top k from all the shard results combined.

Keep in mind that you can also perform exact kNN search so you can compare precision, recall and speed of the two approaches.

Denis_Stefan · September 19, 2023, 3:12pm

Following your explanation, I fail to see the difference between the combination of k: 1 and num_candidates: 1 and k:1 and num_candidates:10 if all documents are taken into account. Shouldn't both combinations return same result? Could you provide a practical example?

Carlos_D · September 20, 2023, 7:44am

Hi @Denis_Stefan !

My apologies - my response was incorrect in terms of total documents searched. I have edited it to avoid confusion to future readers.

On your example, num_candidates: 1 will effectively search a single nearest neighbour on each shard, vs using num_candidates: 10 that will look for 10 . So both examples would return the top candidate, but in the first case just 1 document will be considered.

We believe that approximate knn offers a good, tunable tradeoff for speed and precision. The HNSW data structure is specifically built for this use case.

However, you will need to experiment with num_candidates to find your speed / precision balance. Also, remember that you can use exact knn for comparison or as your knn query method.

I hope this helps!

system · October 18, 2023, 7:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic KNN search questions Elasticsearch vector-search	3	1053	November 6, 2023
The num_candidates parameter leads to some confusing query results Elasticsearch vector-search	7	235	June 9, 2024
Running ANN with num_candidates > 10000 Elasticsearch vector-search	2	272	January 11, 2024
Managed Elastic search for billion scale dense vector index and performance Elasticsearch vector-search	3	1368	December 23, 2022
KNN search returns an empty result set when num_candidates is less than the filtered doc count Elasticsearch vector-search	10	212	October 18, 2024

Semantic search on more than 10k documents

Related topics