For customer requests we currently run two searches - a classic keyword search, and a knn search on a dense vector field. We calculate the query vector outside of ES and we have our own function to calculate final score from the two.
We witnessed some very strange behaviour with knn today. We have several nodes in our Prod cluster, running ES 8.11.2. We have one shard - primary is on one node and replicated on all other nodes. Approximately 150k retail products in our index. We embed our vectors on the product title, which includes brand name.
Performing the knn search on the vector for a particular brand name, with k = 24 and num_candidates = 5000 - the node with the primary shard and around half the replica nodes returned good knn results of relevant products from that brand.
However, on around half the other replica nodes, we got NO products from that brand. knn returned some products from a similarly named brand, and some from not-so-similar brands.
Increasing k to larger values didn't improve results on the 'bad' nodes. HOWEVER, increasing num_candidates to 10000 did result in the 'bad' nodes finally returning the most relevant products.
Our working theory is that the HNSW graphs for approximate knn on the 'bad' results nodes weren't up-to-date. Does this sound feasible? Can drift occur between the HNSW graph on the primary node and the replicas?
We could see that the replica shard size on the 'bad' nodes was usually around 400 - 500MB smaller than the primary shard / good nodes.
Can anybody help shed any light on what is happening here?
How / when / why does the HNSW graph get updated? Should it get updated on every document (re)index? Or is this a background process?
Does the HNSW graph get created on the primary shard node and then replicated to the other nodes? Or is each node responsible for creating its own HNSW graph as it receives replication data?
Is drift in knn / HNSW between nodes a known issue? Or is what we witnessed unexpected behaviour?
Any / all insights gratefully received!