Unbalanced CPU load when enabling vector search

Hello everyone,

I have been trying to work with vector search at scale, but I ends up into a very awkward unstable state of my cluster. I have browse this forum but did not find someone sharing a similar problem to mine. Looking for some help, thanks.

I have the following setup:

  • 90 nodes
  • Each node is a has 32vCPU, 64GBRAM (AWS C6g.8xLarge)
  • Total index size ~300 Go at the moment (roughly 67M documents) (not the only index on the cluster, but the only one with dense vectors.
  • 18 primaries, 5 replicas, distributed evenly (1 shard per node)
  • Each document includes a dense vector field for performing vector search. The dimension of the field is 128!
  • We continuously index, delete and upsert documents (~1M docs a day).
  • We have peaks of 500 queries per seconds (on dense vectors, 5K for all indices)
  • Elasticsearch version 8.4.1

When I have no vector field, everything is going fine:

However, when we enable vector search everything is fine (CPU usage goes up in an expected way), but after several days :

  • Load becomes unbalanced
  • We observe 2 node populations:
    • Some nodes with a CPU load of 35% (normally loaded)
    • Some nodes with a CPU load above 80% (heavy loaded)
  • As the time pass, nodes tend to join the heavy loaded group

We already have checked that:

  • Resetting the cache does nothing
  • Rebooting a CPU heavy-loaded node put it back in a normal state (but another node can shift from normal load to heavy-load)
  • The problem remains after peak hours (i.e. with low traffic)
  • There are no outliers when looking at ram usage or disk usage, number of delete or number of lucene segments in shards.
  • We updated from c5g to c6g to get more RAM : previously when vectorial search was enabled RAM was maxed, now with c6g it is no longer the case (still oscillating between 45Go and 64 Go).

Any idea of where this unbalanced load could come from and how to solve this ?

Thanks a lot.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.