Adaptive Replica Selection and knn load balancing

hey friends :wave:

I posted previously about our efforts on optimising our dedicated knn cluster.

The latest thing we've been trying to understand / solve is why when we run load/stress testing, we often see 1 or 2 nodes serving the bulk of requests - and slowing down / queuing search requests due to 80 - 100% CPU usage - whilst other nodes are chilling at 20% CPU usage and less requests.

We disabled Adaptive Replica Selection, which seems to be having the desired affect, load is consistently distributed across all nodes, resulting in a happy cluster :slight_smile:

What we don't understand is WHY we see this skewed load with the default Adaptive Replica Selection switched on? :thinking:

We have a very simple setup. ~120k docs in an index consisting of just a keyword (id) and dense vector field, with one shard, one primary and other nodes replica.

We've never seen a problem with load distribution on our main traditional keyword cluster - seems to be an issue specifically with vector / approximate knn searching? :thinking: And we're baffled!

Any ideas what might be going as we're all out of ideas! :rofl:

2 Likes