Hi team!
I have an index, only 1 primary shard, I insert documents and then knn searching. When I do a knn query with k=10 and num_candidates=20, I get a batch of results. When k=10 and num_candidates=25, I got some documents with higher scores, which confused me, num_candidates is supposed to be the top n closest documents per shard, Why don't the documents with higher scores for num_candidates=25 appear in the results for num_candidates=20
Meaning, it helps control the approximate nature of the search. Higher num_candidate more time is spent exploring the graph, but a more accurate set of neighbors is returned. Lower num_candidates is faster search, but less accurate.
Thank you for your answer. I have another question. Sometimes, the first time using knn can not search, but I am sure the doc has been stored successfully. When I rewrite it again, the doc can be searched, and the search vector is the same. Give advice freely!
If you want things available immediately, there are various ways to ensure things are searchable immediate. You can wait for a refresh on search, or index, or force one at index time or manually call this API. All have various costs and considerations.
But I think your issue is simply that a refresh hasn't occurred between the index & the search.
The phenomenon is:
For example, if i vectorize the text "i am a doctor" and store it, then when searching,
I vectorize the question "doctor" and perform a knn search. At this time, no search was found,
I vectorize the original text "i am a doctor" and perform a knn search, it works.
then I rewrite this vector, and then I do the first search, which is "doctor", and then I get the search,
and these searches k, num_candidates, these are the same, which is confusing to me, Is this also due to the concept of "nearest neighbor"? At first, I thought that the _score of the search results could be used as the degree of "neighbor", but now it seems that it is not?
Looking forward to your answer, It bothered me for a long time
Oh, that's a mistake in my description, what I meant by not being able to find "I'm a doctor" is that there were a bunch of vectors, and I used a vector to search for it, set the value of k, and the first time I searched for it I didn't find him in the top K results, and when I rewrote the vector I was able to find it in the top K results, and it had a higher score, and was ranked higher.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.