Explanation about DiskBBQ parameters

Hi, I am looking into using DiskBBQ. I’ve read this blog post, and I noticed there are multiple parameters that can be defined, but there isn’t a thorough explanation about them:

  • cluster size - with a default value of 384
  • visit percentage - with a default value of 0

I’d like a deeper explanation about both and how exactly the vector search works with them.

If it’s relevant, I have indices where each one has one shard containing 4.5 million CLIP vectors.

The BBQ (Better Binary Quantization) docs make reference to DiskBBQ and it’s configuration for visit_percentage and other query params and cluster_size and other index params.

In addition to the docs though. An overview is that cluster_size is related to how many groupings of vectors are created total. The smaller the cluster size the more clusters must be searched increasing accuracy but at the cost of performance, 384 as a default is something we experimentally arrived at as a nice compromise across a variety of datasets. But honestly I’m not sure how well it will perform for your CLIP vector dataset; I’d be curious to learn about how it performs for you. For visit_percentage the default value is 0 which just means that it gets derived from k or num_candidates. visit_percentage is the percentage of candidate vectors that are collected within the DiskBBQ index structure before stopping. visit_percentage is the most fine grained way to tweak how much is explored. If it is provided we derived a reasonable visit_percentage from num_candidates. In general I would start with something like 1% here and gradually increase this value to increase accuracy; this worked well for us in initial experiments.

You’ll likely want to oversample a good bit as well. In general because BBQ is reducing each dimension of the vector down to a single bit representation accuracy in comparisons is lost but performance gains greatly make up for this accuracy lose whereby oversample larger numbers of candidates will in our experiments more than make up for the effect of quantization on accuracy.

This blog on Hierarchical K-means talks a good bit about how we experimentally arrived at 384 for the default cluster_size.

Lastly to give you a little more sense of what DiskBBQ is doing. We are partitioning the space of all vectors into these clusters or really partitions. We then compute centroids of these clusters and search those first to discover which partitions are worth evaluating at search time. Closest partitions are explored first until we reach some stopping criteria such as meeting the visit_percentage. Maybe that high level summary will give you a bit of intuition here.

Feel free to ask additional questions.