When doing a knn search there is a parameter k which specifies the number of best matching documents to return (approximately). You can also use the size parameter to determine the number of documents to return. Is there ever any reason why k and size would not be set to the same value?

And what about the from parameter, it can be used with knn searches, but I suppose that how it works is that if you have a from value of 10 then elasticsearch will compute the k+10 best matching documents and then return them all except the first 10. Is that correct? So the higher the from value the heavier the computation will be?

My questions are related to this query that I am using:

As you mentioned, kNN search adds k matching documents to the search. So, if you set k=20, then it finds 20 matches. Then size takes the top scoring documents from these and returns them, it only tells how many hits should be returned in the response. In this case, considering your query, it will return the 20 matches.

It's a little confusing when you're doing only kNN search. But it fits with how the search API is designed.

The from parameter is the same idea but defining the number of hits to skip, it’s good to paginate search results.

Now, to gather results, kNN search finds a num_candidates number of approximate nearest neighbor candidates on each shard. Elasticsearch collects num_candidates results from each shard, then merges them to find the top k results. So, if what you want is faster searches you can decrease num_candidates, but at the cost of potentially less accurate results.

We only look directly at the top k and num_candidates as of version 8.11 and earlier. I am unsure if this behavior will ever change.

If k is larger than size, we still search that many hits, but you will only retrieve size. In the scenario where from>0 & k > from + size, the results will indeed skip the first from elements and only return size.

But, we do NOT adjust k and num_candidates with those provided parameters.

I see, when combining a regular query with a knn search the combined results can be more than k, so size will then determine the actual returned number of results. In that case it may make sense to pick a different value for either of those.

And my conclusion about the computational costs: from doesn't affect the computational cost. The query/knn clauses are executed as is and sliced up by size/from at the very end.

Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.