Question about Optimized Scalar Quantization

Hi, Good day,

We are planning to use Elasticsearch for vector search in our application, and we have a need to check the feasibility to remove the storage of original floating point vectors from Lucene which Elasticsearch uses as well internally, mainly due to that the storage size is huge to us.

I had a read of a few blog posts or tutorials on Elastic search labs, and basically understand that there are mainly two scalar quantization methods/approaches in Lucene, represent maybe by [1,2] and [3] separately. To make it easy to distinguish, I will use OSQ to mean the quantization method in [3], and Non-OSQ to mean the one in [1,2].

My understanding is that both OSQ and Non-OSQ currently requires the storage of original floating point vectors.

  • The OSQ needs to compute the centroid $m$ based on the original floating point vectors.
    • Is the $m$ also computed segment wise? If yes, meaning new centroid needs to be re-computed during segment merging?
  • The Non-OSQ might need to re-compute the quantization interval during segment merging and maybe even re-quantization, which requires original floating point vectors.

From the reading of [3], it is inspiring that you mentioned

We plan to roll out 2, 4 and 7 bit quantization schemes and indeed unify all our index compression options to use this same underlying technique. Furthermore, we anticipate that with 7 bit quantization we should be able to discard the raw floating point vectors and plan to evaluate this thoroughly.

This seems to be what we need in our case, and we understand that you might have a priority set for the above unify and evaluation. May I ask whether you might already have any rough schedule about when would you plan to do both of them?

Thanks a lot and look forward to your reply.

[1] Understanding scalar quantization in Lucene - Elasticsearch Labs
[2] Scalar quantization optimized for vector databases - Elasticsearch Labs
[3] Understanding optimized scalar quantization - Elasticsearch Labs

@BenTrent I continue our previous communication here to have a formal topic defined.

We have no schedule on when this will be implemented. It has been deprioritized due to other work currently.

@BenTrent thanks for the information w.r.t schedule.

May I ask the correctness of the questions about $m$ in the description? Thanks.

  • Is the $m$ also computed segment wise? If yes, meaning new centroid needs to be re-computed during segment merging?

Correct, its per segment and calculated on merge. But on merge, we can simply sum the already calculated segment centroids and do a weighted average.

@BenTrent thanks for the info.

okay, does it mean that Elasticsearch/Lucene only do a re-quantization based on raw floating point vectors for the merging segments whose original $m$ too much drifted with the newly computed $m$ based on all merging segments?

@BenTrent may I get your answer for the above question? Thanks a lot.

This is not how BBQ is done today, but this is definitely an optimization we want to apply eventually.

This is how we handle it for our older int4 and int8 quantization indices.

Hi @BenTrent, thanks for your reply.

This is not how BBQ is done today, but this is definitely an optimization we want to apply eventually.

do you mean that the current BBQ using OSQ does not do re-quantization on segments merging?

@BenTrent may I get your input here? thanks and sorry for frequent questions.

Ben was referring to the fact that your suggestion to detect whether m has drifted too far is not an optimization we have built yet but it is one we have considered already.

During merge we do re-calculate the centroid that is needed for quantization and write those new vectors out. And to your earlier point about optimization right now we do this every time. But want to optimize this further in the future.

Feel free to ask additional questions. Happy to talk this through with you more. Sorry I missed this thread earlier. It's a fun topic for me.

Hi @john-wagster, thanks for your answers.

just to rephrase your info, so during the segments merge, the new centroid will be calculated via weighted sum from centroids from each segments, then the re-quantization will always be done in the last step(s) of segment merging based on the new centroid and the stored original floating point vectors, right?

during the segments merge, the new centroid will be calculated via weighted sum from centroids from each segments, then the re-quantization will always be done in the last step(s) of segment merging based on the new centroid and the stored original floating point vectors, right?

correct. you got it!

There's a few edge cases where we will rebuild the new centroid from the the original floating point vectors such as when there are deleted docs. But typically what you described is exactly what happens.

You can review it in the code yourself here:

@john-wagster thanks a lot for the confirmation and providing more detail information.

No problem let me know if you have other questions!

@john-wagster thanks, I do would like to clarify one point in [1], about the following highlighted part

Instead of

the set of all possible quantized vectors

Shouldn't it be

the set of all possible quantized approximation of the original floating point vectors

?

Thanks.

[1] Understanding optimized scalar quantization - Elasticsearch Labs

the set of all possible quantized vectors

the set of all possible quantized approximation of the original floating point vectors

I think a quantized vector is an approximation of the original floating point vector. So to me this reads ok. It's basically a short hand and the two statements I believe say the same thing.

Feel free to dig in here if you think I'm missing some subtly.

Hi @john-wagster, sorry that my rephrase is not really very clear as well, by quantized vectors, I think this meaning the vectors which already only contain the values after the quantization. Actually, by quantized approximation of the original floating point vectors, I meant the float value vectors shown in the following formular.

The quantized vectors contain values from 0 to 2^n, depends on the number of bits n for quantization. But the value range [a, b] could be at a different scale than [0, 2^n], and is related to the value shown in the above screenshot. I hope this explains my thought a bit more clearer.

Ah I think I better understand your question. So the passage that references quantized vectors is referring to those which are quantized but then without any corrections applied.

The prior passage that references quantized approximation of the original floating point vectors is a reference to the quantized vectors after corrections have been applied. As the passage mentioned we never directly work with that quantity. Instead we compute the estimated distances between the quantized vectors and then apply corrections to recover the approximate distances between the original float vectors.

I still believe the article is correct though in how it distinguishes those to answer your original question.

Does that help allay any confusion there?

My apologies, it's been a bit since I've read through the article. And had just forgotten about that prior passage.

Hi @john-wagster, no problem, thanks for your clarification. Are the correction terms more used during dot product calculation? The geometry example, to me seems more about describing the geometry properties of the vectors.