Why "knn_query" doesn’t have a separate k parameter?

alliswell · May 8, 2024, 12:31pm

Thanks for the explaining how merging works differently for knn-query and top-level-knn.

Coming back to

Will it still be useful to have a separate k parameter, if it is applicable only per shard base?

short response;

I think we should still keep support for separate k parameter. (even when at shard level). We should keep it for the very same reason why we have separate k parameter in top-level-knn. Because it not only helps during selecting top-k at coordinator node. But it is also use to select top-k within each shard.

not so short response;

For a moment lets keep Elasticsearch and sharding out of picture and lets see how search for most nearest-neighbours for given query_vector is performed in HNSW graph.

The HNSW graph consists of multiple layers (refer attached figure).

Step 1: The search begins by selecting a random entry point (current_vertex) at the highest layer.

Step 2: Calculate the distance between the query_vector and the current_vertex, as well as all neighbors of the current_vertex.

Case 1: If the closest vertex is a neighbor of the current_vertex, greedily select that vertex and make it the current_vertex. If the current layer is not lowest layer goto step-2, othwerwise goto step-3
Case 2: If the closest vertex is the current_vertex and it’s not the lowest layer, move down one level and goto Step 2. If the current layer is the lowest layer, proceed to Step 3.

Step 3: Ultimately, the current_vertex at the lowest layer is identified as the nearest neighbor for the given query_vector.

reason

During the whole search process, after randomly selecting entry point we employed “greedy” strategy to select next current_vertex. And at each step it is using only local information to make decision. Due to this the selected vertex might not be the closet neighbour to query_vector. To overcome this HNSW suggests following:

Instead of finding only one nearest neighbour on each layer, the efSearch (i.e num_candidates) closest nearest neighbours to the query_vector are found and each of these neighbours is used as the entry point on the next layer.

reference => link

Returning to the context of Elasticsearch, each shard has its own HSNW graph. Suppose the goal is to find the 10 nearest neighbors from each shard. We should use more than 10 entry points (i.e., num_candidates/efSearch should be > 10) to ensure more precise vector matches. To enable this, we should add support for k (which would be set to 10 in this case).

Topic		Replies	Views
"The heap is empty" error on KNN search Elasticsearch vector-search	11	745	January 23, 2023
KNN search returns an empty result set when num_candidates is less than the filtered doc count Elasticsearch vector-search	10	547	October 18, 2024
Elastic search sub_searches with knn and rrf Elastic Search	4	472	May 22, 2024
Aggregate Score for Hybrid Search Elasticsearch vector-search	22	2438	March 17, 2023
Elasticsearch Aproximate KNN search using custom score calculation using a custom priority field Elastic Search painless , elastic-site-search	13	1054	April 18, 2024

Why "knn_query" doesn’t have a separate k parameter?

short response;

not so short response;

reason

Related topics