The num_candidates parameter leads to some confusing query results

Hi team!
I have an index, only 1 primary shard, I insert documents and then knn searching. When I do a knn query with k=10 and num_candidates=20, I get a batch of results. When k=10 and num_candidates=25, I got some documents with higher scores, which confused me, num_candidates is supposed to be the top n closest documents per shard, Why don't the documents with higher scores for num_candidates=25 appear in the results for num_candidates=20

Elasticsearch version : 8.7.0

result:

Hey @EricTowns , num_candidates is the number of vectors searched per shard. It is the same as the efSearch parameter in the original HNSW paper: [1603.09320] Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs

Meaning, it helps control the approximate nature of the search. Higher num_candidate more time is spent exploring the graph, but a more accurate set of neighbors is returned. Lower num_candidates is faster search, but less accurate.

Thank you for your answer. I have another question. Sometimes, the first time using knn can not search, but I am sure the doc has been stored successfully. When I rewrite it again, the doc can be searched, and the search vector is the same. Give advice freely!

Sometimes, the first time using knn can not search, but I am sure the doc has been stored successfully

Do you mean the vector isn't in the result set? Or that the search throws an exception and returns an error?

Elasticsearch has a refresh & API: Refresh API | Elasticsearch Guide [8.13] | Elastic

If you want things available immediately, there are various ways to ensure things are searchable immediate. You can wait for a refresh on search, or index, or force one at index time or manually call this API. All have various costs and considerations.

But I think your issue is simply that a refresh hasn't occurred between the index & the search.

The phenomenon is:
For example, if i vectorize the text "i am a doctor" and store it, then when searching,

  1. I vectorize the question "doctor" and perform a knn search. At this time, no search was found,
  2. I vectorize the original text "i am a doctor" and perform a knn search, it works.
  3. then I rewrite this vector, and then I do the first search, which is "doctor", and then I get the search,
    and these searches k, num_candidates, these are the same, which is confusing to me, Is this also due to the concept of "nearest neighbor"? At first, I thought that the _score of the search results could be used as the degree of "neighbor", but now it seems that it is not?

Looking forward to your answer, It bothered me for a long time

  1. I vectorize the question "doctor" and perform a knn search. At this time, no search was found,

Does this mean no data at all? That doesn't make sense as you should always get k neighbors back.

Or are you talking about a particular neighbor that you are interested in?

Are you forcing a refresh before searching in step 1.?

Is this also due to the concept of "nearest neighbor"?

No. _score is determined by vector similarity.

Oh, that's a mistake in my description, what I meant by not being able to find "I'm a doctor" is that there were a bunch of vectors, and I used a vector to search for it, set the value of k, and the first time I searched for it I didn't find him in the top K results, and when I rewrote the vector I was able to find it in the top K results, and it had a higher score, and was ranked higher.

Hello @EricTowns, is the error still happening? If so, could you send your query here and share a screenshot of the two behaviors?