Thanks for clearing this up for me! This answers my original question, but I'm happy to learn more about KNN or answer any other questions you two have.
awesome!
Reading through the docs, use of
max_size
k
andnum_candidates
does seem to be an anti-pattern. I haven't done a lot of vector math up to this point, so I will research the HNSW algorithm so I can better understand those minutiae.
if it helps at all happy to talk through this in a little more detail; I work on that part of the ES stack.
The app I'm working on runs this KNN query to build a pool of possible final documents, which is then further refined later in the app (I can't use a filtered KNN search because of the issues outlined in my other post here: Efficient Subquery Combinations).
I missed the original post you had created; apologies for that. I took a look briefly and would be curious where you've gotten with that query. We might be able to iterate here a bit and you can always request consulting services too. Kinda sounds like you've tried a good bit of stuff already though. And I'm also not sure about the rrf suggestion but I bet if you reached out to consulting or support they might (actually not sure) be willing to help you out like give you some free cycles to play around with rrf.
for reference pulled from your other post:
- A' = A
- B' = B - A
- C' = C - (A + B)
- D' = D - (A + B + C)
To get my high level thoughts (without having spent a ton of time thinking about the queries you had in the other post). For what it's worth initially num_candidates
will impact the explored HNSW graph which means you're spending a lot more time exploring it. At some point explorations like if you bump up the limit past 10k will start to time out probably. If k
and num_candidates
are the same what you get back is all the results of the closet 10k in that HNSW graph, k
becomes mostly irrelevant. If you really need all of those results back then that's probably the best you can do. If you can do multiple queries it might ultimately be more efficient. You might have tried this already but querying for A
and then subsequently querying for B
with a list of filter criteria that eliminates all docs from A but with smaller num_candidate
lists may yield more efficient results (I'm honestly not sure, be curious to learn if that's the case and if I'm understanding what you are trying to do here). case 2
you mentioned in the other post as well seems like it could be a subsequent metadata only query too. Fun problem space nonetheless. But definitely seems like you'll have to play around here to get to an efficient query.