I am currently developing a semantic search solution and I have to work with more than 10k documents (more than the maximum number of candidates which is 10k for the kNN algorithm). I am trying to find a solution to search in whole index, not just 10k documents. Also I am interested in the way in which the 10k documents are selected for kNN.
So basically the question is how can I execute a semantic search on an index which has more than 10k documents and how the 10k candidates are selected?
When performing kNN search, each index shard will perform an approximate neighbours search. Each of these searches will at most consider num_candidates as the top results. Thus, num_candidates does limit the total number of documents searched on each shard.
knn query uses a data structure (HSNW) for making it easy to find the nearest neighbours to the vector that represents the search query. This is built to avoid having to compute the similarity to all vectors and retrieve the best possible candidates.
However, this is approximate. It offers a tradeoff between speed and precision, as we depend on the num_candidates searched. Increasing the num_candidates will increase precision (as more candidates will be considered) at the cost of search speed.
Once num_candidates have been calculated on each shard, the top k will be selected on each shard. After that, the query will return the top k from all the shard results combined.
Keep in mind that you can also perform exact kNN search so you can compare precision, recall and speed of the two approaches.
Following your explanation, I fail to see the difference between the combination of k: 1 and num_candidates: 1 and k:1 and num_candidates:10 if all documents are taken into account. Shouldn't both combinations return same result? Could you provide a practical example?
My apologies - my response was incorrect in terms of total documents searched. I have edited it to avoid confusion to future readers.
On your example, num_candidates: 1 will effectively search a single nearest neighbour on each shard, vs using num_candidates: 10 that will look for 10 . So both examples would return the top candidate, but in the first case just 1 document will be considered.
We believe that approximate knn offers a good, tunable tradeoff for speed and precision. The HNSW data structure is specifically built for this use case.
However, you will need to experiment with num_candidates to find your speed / precision balance. Also, remember that you can use exact knn for comparison or as your knn query method.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.