Poor knn results from *some* nodes

For customer requests we currently run two searches - a classic keyword search, and a knn search on a dense vector field. We calculate the query vector outside of ES and we have our own function to calculate final score from the two.

We witnessed some very strange behaviour with knn today. We have several nodes in our Prod cluster, running ES 8.11.2. We have one shard - primary is on one node and replicated on all other nodes. Approximately 150k retail products in our index. We embed our vectors on the product title, which includes brand name.

Performing the knn search on the vector for a particular brand name, with k = 24 and num_candidates = 5000 - the node with the primary shard and around half the replica nodes returned good knn results of relevant products from that brand.

However, on around half the other replica nodes, we got NO products from that brand. knn returned some products from a similarly named brand, and some from not-so-similar brands.

Increasing k to larger values didn't improve results on the 'bad' nodes. HOWEVER, increasing num_candidates to 10000 did result in the 'bad' nodes finally returning the most relevant products.

Our working theory is that the HNSW graphs for approximate knn on the 'bad' results nodes weren't up-to-date. Does this sound feasible? Can drift occur between the HNSW graph on the primary node and the replicas?

We could see that the replica shard size on the 'bad' nodes was usually around 400 - 500MB smaller than the primary shard / good nodes.

Can anybody help shed any light on what is happening here?

How / when / why does the HNSW graph get updated? Should it get updated on every document (re)index? Or is this a background process?

Does the HNSW graph get created on the primary shard node and then replicated to the other nodes? Or is each node responsible for creating its own HNSW graph as it receives replication data?

Is drift in knn / HNSW between nodes a known issue? Or is what we witnessed unexpected behaviour?

Any / all insights gratefully received!

Hey there @peedeeboy thanks for the question.

I suspect that the number of segments on these replicas is different, indicating that the HNSW graphs themselves are different. It is expected that the HNSW graphs for replicas may differ because we don't do segment based replication and thus the graphs are re-created on each replica.

Some recommendations:

  • Use preference for routing users to consistently use the same shard, as described in getting consistent scoring.
  • Experiment with different values in knn and dense_vector queries. For example you can use the ef_construction parameter (in dense_vector) to consider more candidates when building the graph, and force_merge to apply it to the dense vector field. For both dense_vector and knn you can also increase num_candidates and k as you already experimented with.

Hi @Kathleen_DeRusso ! :wave: Thanks so much for your reply!

I suspect that the number of segments on these replicas is different, indicating that the HNSW graphs themselves are different. It is expected that the HNSW graphs for replicas may differ because we don't do segment based replication and thus the graphs are re-created on each replica.

I think this is the response we were dreading :sob: We were hoping this was just a freak one-off

Thanks for your suggestions on how we might improve / work around this:

  • Use preference for routing users - I don't think this helps us as it only ensures we will be consistently serving some customers bad results :cry:

  • force_merge - we previously looked into performing a force merge after our big morning reindex or an index rebuild (we noticed that following an index rebuild, knn performed noticeably worse for a while, and reasoned it was until segments had been merged - this situation improved drastically once search_workers were introduced for parallel knn search ), but Elastic's docs recommend only force merging on a Read-Only Index, and because we are a busy retail site we have constant trickles of updates as products go in / out of stock, so it didn't seem feasible to force_merge as standard practise

  • ef_construction - we can take a look into this an see if it helps? The ES docs don't specify a max value? Is there any further reading on this?

  • increase num_candidates - running at num_candidates = 10k did work as a sticking plaster over the issue. The thing we're a bit confused about is we noticed in ES 8.13.2, the default num_candidates is now 1.5 * k, which makes us think we shouldn't need to be running num_candiates = 5k or 10k. Again, any deeper reading on what this actually does under the hood would be much appreciated.

Hey @peedeeboy ,

Yes, you're right - preference will ensure consistency, but if the results are truly poor then it's not a sufficient workaround for your use case.

RE: ef_construction, the maximum possible value is 10,000. However, you could start smaller and try bumping it up even to 200 first and see if that helps. Note that higher values for ef_construction will result in a long graph build time as it will increase both graph quality and consistency.

I agree that you don't want to force merge as a standard practice, however if you make a change to ef_construction for example the HNSW graph will not be re-created automatically. To re-create the HNSW graph you can do a one time force merge, or create a new index and reindex into it. If the issue is on the replicas only and not the primary, you could also try to re-create the replica (e.g. remove the replica and re-create it). Perhaps you could look at segment config as well to merge more frequently?

The fact that you had to increase num_candidates to get reasonable results on the replica isn't ideal. I don't think you should have to go that high for k and num_candidates in general, but since the underlying data structure is different it might be something that helps in this particular case.

I hope that helps!

Thanks @Kathleen_DeRusso ! That is super useful. :+1:

Its an odd one - we quite happily accept this is approximate knn, and that there may be some slight variation between nodes. We've seen previously maybe one or two products variation (but still relevant products) or in a slightly different order. And that is perfectly understandable and fine.

In nearly a year of using knn (and being pretty happy with it) this is the first time we've noticed entire knn results sets not being particularly relevant. It does seem like some of our replica nodes were very behind the primary in rebuilding their HNSW graphs because....... reasons? After our morning reindex this morning, the shard sizes were all back to being much more similar, and this troublesome query was now returning good results across all nodes :person_shrugging:

We'll take your advice and look into ef_construction and trying to configure ES to be more aggressive with segment merges (it sounds like this is what causes the HNSW graphs to be recalculated?). We're already following the best practise advice for bulk indexing when we do rebuild the index, so we'll have a look at what else we might do... increase number of segment merge threads?

Thanks again for your help! Much appreciated!

I'm glad to hear that things are better! If you have CPU resources to spare (and you likely do) then increasing the number of segment merge threads may help here.

It does sound like an odd problem directly related to rebuilding the HNSW graph on the specific replica. I'm interested in whether this is a one off issue or if it's something that recurs.

@Kathleen_DeRusso thanks again for all your time!

We are going to find time over the next couple of weeks to write scripts to run n thousand of our most popular searches against each node in our cluster and look for any drastic differences in the knn results...

Our ES cluster is scaled for 'peak traffic and then some', so there is no CPU / Memory pressure on normal trading days, which is why it's weird some replica nodes didn't keep their HNSW graphs up-to-date.

I'll post back here if we find out anything you might be interested in :+1: